Memory-based Deep Reinforcement Learning for Obstacle Avoidance in UAV with Limited Environment Knowledge

11/08/2018 ∙ by Abhik Singla, et al. ∙ indian institute of science 0

This paper presents our method for enabling a UAV quadrotor, equipped with a monocular camera, to autonomously avoid collisions with obstacles in unstructured and unknown indoor environments. When compared to obstacle avoidance in ground vehicular robots, UAV navigation brings in additional challenges because the UAV motion is no more constrained to a well-defined indoor ground or street environment. Horizontal structures in indoor and outdoor environments like decorative items, furnishings, ceiling fans, sign-boards, tree branches etc., also become relevant obstacles unlike those for ground vehicular robots. Thus, methods of obstacle avoidance developed for ground robots are clearly inadequate for UAV navigation. Current control methods using monocular images for UAV obstacle avoidance are heavily dependent on environment information. These controllers do not fully retain and utilize the extensively available information about the ambient environment for decision making. We propose a deep reinforcement learning based method for UAV obstacle avoidance (OA) and autonomous exploration which is capable of doing exactly the same. The crucial idea in our method is the concept of partial observability and how UAVs can retain relevant information about the environment structure to make better future navigation decisions. Our OA technique uses recurrent neural networks with temporal attention and provides better results compared to prior works in terms of distance covered during navigation without collisions. In addition, our technique has a high inference rate (a key factor in robotic applications) and is energy-efficient as it minimizes oscillatory motion of UAV and reduces power wastage.



There are no comments yet.


page 1

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Supplementary Material

For supplementary video see: The project’s code is available at

I Introduction

Unmanned aerial vehicles (UAVs) or “drones” are cyber-physical systems that can be operated either by remote control (using a mobile application on a smartphone over a wireless channel) or autonomously using onboard computers. Ranging from crop [1] and infrastructure monitoring [2], rescue operations and disaster management [3], to more popular uses like goods delivery and filming [4, 5], UAVs are increasingly finding their application in diverse scenarios. Owing to their small size and light weight, UAVs can penetrate into constricted spaces or effortlessly glide over pre-specified geographical areas, the majority of which may possibly be beyond the reach of humans. However, UAVs still lack some elementary capabilities which impede their widespread use. One such example is the ability to avoid obstacles. Avoiding obstacles is a non-trivial task because the obstacles might be so positioned that avoiding them requires delicate and dexterous movements. To be able to avoid obstacles, the UAV must be able to perceive the distance between itself and the obstacles along with other visual cues such as the shape of the obstacle and it’s height. This crucial visual information enables a UAV to infer traversable spaces and obstacles (see Fig. 1 for an illustration).

Fig. 1: A UAV encountering stationary as well as moving obstacles in an indoor environment. Here, the walking human being is a moving obstacle, whose direction and future intent of motion cannot be predicted.

Classical approaches for inferring visual geometry include techniques like Simultaneous Localization and Mapping (SLAM) and Structure from Motion (SfM). These techniques use measurements from sensors like Kinect [6], Light Detection and Ranging (LIDAR), Sound Navigation and Ranging (SONAR), optical flow, stereo and monocular cameras for computation. SLAM algorithms utilize measurements from a single sensor [7] or a combination of sensors [8]

to build or update a map of the environment surrounding the UAV while simultaneously using the same to estimate the UAV’s position. The SfM approaches use measurements from sensors like optical flow

[9] and/or a moving monocular camera [10] to determine depth map and the 3D structure. SLAM and SfM approaches require the UAV to compute a path and then navigate through it. The UAV needs to repetitively hover, compute the depth map and then find a suitable path. Thus, path planning on the fly is not easy in SLAM and SfM approaches. This also means that SLAM and SfM approaches cannot be used for real-time obstacle avoidance based on the visual information gathered about the surroundings. [11] proposes a SLAM technique which computes a path on the fly. However, such an enhancement does not avoid dynamic and non-stationary obstacles whose movements cannot be predicted. Another disadvantage of using SLAM and SfM methods is that these do not detect untextured walls. Untextured walls normally arise in indoor environments and hence being able to distinguish textures on walls is crucial to obstacle avoidance.

Kinect, LIDAR, SONAR, optical flow and stereo camera sensors are widely used for depth estimation (see [12, 13]) and hence these can be potentially used for obstacle avoidance as well without resorting to computation-intensive approaches like SLAM and SfM. However these sophisticated sensors are expensive and add unnecessary burden to the UAV in terms of weight as well as consumption of power. Moreover, optical flow and stereo camera are not suited for long-range obstacle avoidance. Other sensors like for example, the monocular camera, is essential for every UAV application, as it gives visual information. The monocular camera is a low-cost sensor which provides RGB images of the UAV’s ambient environment. In comparison to the heavy-weight sensors mentioned earlier, a monocular camera is light-weight. The question then is whether we can use a monocular camera for depth estimation as well and plausibly for obstacle avoidance.

Extracting the range information (i.e., distance between the sensor and the various objects in front of the sensor) from the monocular RGB images is a challenging problem, simply because the camera captures only the 2-D information of the surrounding environment. Some recent works [14]-[23]

address the issue of depth prediction using monocular camera RGB images by leveraging deep learning techniques. Supervised and semi-supervised learning approaches (


) collect huge amounts of data consisting of the monocular images and the corresponding depth maps to train a deep learning model. Such models are based on convolutional neural networks (CNN) or their variants (residual networks

[14]). Given a single image, the deep network outputs the predicted depth map from the monocular image. The proposed approaches in [14]-[18] however do not tackle the vital problem of UAV obstacle avoidance and navigation, which is the problem that we are interested in this paper.

Varied obstacle avoidance techniques in conjunction with depth prediction are proposed in [19]-[23]. [19] proposes a behavior arbitration scheme to obtain the yaw and pitch angles for the UAV to avoid an obstacle and for navigation in general. Trajectory planning using obstacle bounding boxes and depth estimation is explored in [20]. This work designs a CNN architecture that jointly estimates depth and obstacle bounding boxes. The extracted information is then utilized in the RRT-Connect planner to plan trajectories between a start and end point. [21] proposes two different CNN architectures - one for depth and surface normal estimation and the other for trajectory prediction. Both the CNNs use a 3D cost function for training and evaluation. [22] follows an unconventional approach, wherein the authors collect a dataset of UAV crashes. This dataset is labeled and then input to a CNN model. Given an image obtained from the monocular camera, the network predicts how the UAV should move in the next instant to avoid a crash. UAV navigation in the presence of obstacles is inherently a sequential decision making problem under uncertainty. This is because an action taken at an instant affects the path of the UAV in the future instants too. Hence, it is appropriate to design obstacle avoidance in UAV as a Reinforcement learning (RL) problem. CAD2RL [23] proposes a Deep RL (DRL) method for obstacle avoidance in indoor flight. This work trains a UAV for navigation using simulated 3D hallway environments. For this, a large number of 3D hallway environment images with different lighting, wall textures, furniture placement are generated and a deep Q-network learns UAV movement policy on these images. However, this work requires substantial amount of data concerning the images of hallway environments and is not efficient. Moreover, the method proposed in [23] is not intuitive. It does not attempt to mimic how humans learn to avoid obstacles. The basic information which helps the human brain to navigate is the depth information (owing to the binocular vision) and not the RGB information.

Our work adds a new dimension to the existing work on UAV obstacle avoidance. We are motivated from how humans decide what to do next given a scenario. Humans have limited or partial access to the environment, but still are able to solve challenging problems in daily lives. All this is possible, because human brain has memory which is key to summarizing and storing relevant information for tackling problems. This memory is capable of effectively storing and recalling relevant information gathered over time in order to take the next suitable decision in every scenario. UAV obstacle avoidance and navigation also present a similar problem of partial observability which requires a notion of memory. For example, while navigating, a UAV may fly towards a corner. When it is approaching the corner, the depth map might indicate more space in the front when compared to the sides. The lack of temporal information coupled with limited field of vision of the monocular camera makes the UAV to move ahead towards the corner and crash onto the wall. Such scenarios are very common in UAV navigation and hence require a controller which can utilize the relevant past information. Our aim is to design a UAV control algorithm which has the capability to combine information obtained over a period of time in order to make better navigation and obstacle avoidance decisions.

We propose a deep RL method which enables the UAV controller to collect and store relevant observations gathered over time. This method is based on recurrent neural network (RNN) architecture with an additional function called Temporal Attention. Using this architecture the UAV controller learns a control policy to avoid obstacles.

I-a Organization of the Paper

The next section describes the method we have developed for UAV obstacle avoidance. Section III gives the details of experimental settings and the simulation environments used for highlighting the performance of our method. Sections IV and V describe the results on a number of simulation settings and also bring out the advantages as well as limitations of our approach. Section VI concludes the paper and points out future improvements for our method.

Ii The Method

The objective of our work is to find a suitable policy (a sequence of actions given states of the environment) for UAV navigation that avoids obstacles (both stationary and mobile). We propose a general method which can find such suitable policies. Our method can be integrated with a high-level planner which is supplied with overall path objective, a start and a goal position.

Ii-a Problem Definition

In order to safely navigate without colliding against obstacles in an indoor or outdoor environment, the UAV needs to be aware of the state of the environment. The state of the environment is a tuple of properties of the environment which characterize it and aid the UAV in navigation. Once the state is known, the UAV selects an appropriate action . The action the UAV chooses affects the visual information available to the UAV. In the obstacle avoidance problem, this means the UAV chooses to move in some particular direction leading to a change in its position, orientation and visual feedback. The UAV gets to observe more obstacles or perhaps more free space in front depending on this change in position and/or orientation. As noted in Section I, the UAV needs to choose an action depending on the state at every instant when it navigates through the environment. Further, each action taken affects future states and hence future decisions of the UAV. Based on the action taken, the realization of the next state is probabilistic implying that navigation by avoiding obstacles is a sequential decision making problem in the face of uncertainty.

Prior works [23, 24] assume that the monocular image of the environment is a good indicator of the state of the environment. However, since the UAV’s monocular camera has a limited field of vision, we believe that the UAV controller cannot infer the full state of the environment solely based on the RGB image. Instead, the UAV controller only has an estimate of the state and this estimate is formally known as an observation. In the method we propose, the input to our model is a monocular RGB image without depth or other sensory inputs, whereas the observation

is the predicted depth map obtained from the monocular image. Based on these assumptions, we model the UAV obstacle avoidance and navigation problem in the framework of partially observable Markov decision processes (POMDPs).

We propose a POMDP model for the obstacle avoidance problem. Here is the set of states of the environment, referred to as “state space”, while is the set of feasible actions and referred to as the “action space”.

is the transition probability function that models the evolution of states based on actions chosen and is defined as

. is the reinforcement or the reward function defined as . The reward function serves as a feedback signal to the UAV for the action chosen. For instance, in a state , if the UAV selects an action which steers it away from an obstacle, the reward for that state-action pair is positive, implying that the action is beneficial in the state , while picking an action which results in collision will naturally yield a negative reward. is the set of observations and an observation is an estimate of the true state .

is a conditional probability distribution over

, while is the discount factor. At each time , the environment state is . The UAV takes an action which causes the environment to transition to state with probability . Based on this transition, the UAV receives an observation which depends on with probability . The aim is to solve the obstacle avoidance problem, which translates to the task of finding an optimal policy . By determining an optimal policy, the UAV controller is able to select an action at each time step that maximizes the expected sum of discounted rewards, which is denoted as .

Ii-B Model

We need to define the sets , and the functions in order to find an optimal policy. The input to our model is the monocular RGB image, without any depth information. Our model extracts the depth map from the RGB image which acts as the observation for the UAV controller. The depth map predicted from the RGB image indicates the distance between the objects and the UAV. Given an observation, the feasible actions () available for the UAV are “go straight”, “turn right” and “turn left”. The reward function is designed using the depth information and its exact analytical form is explained in Section III. In order to determine the functions and , we must be aware of the structure of the environment and the motion dynamics of the UAV. In practice, these are impossible to know. , but the UAV must be capable of navigating in unknown, unstructured environments in the presence of other factors like wind, turbulence etc. Thus, we propose a Reinforcement learning technique to find an optimal policy for UAV navigation. Reinforcement learning is a model-free learning-based approach to solve (PO)MDPs when the model information via (and ) is not available.

When model information is unavailable, one of the well known approaches learns an optimal policy using Q-values. The Q-value corresponding to the policy is defined as the expected sum of discounted rewards obtained by taking the action upon state and following the policy thereafter. The optimal Q-values are defined as . Once the optimum Q-values in a state are obtained, the optimal action is picked by finding . So, the optimal policy can be computed by finding the optimal action for every state. Q-learning [27], is a model-free iterative algorithm to learn the optimal Q-value of every state-action pair. The Q-value update of any such pair is given below:


However, this algorithm suffers from curse of dimensionality. This is because iterative learning the Q-values for huge state-space requires maintaining and updating Q-values for all unique state-action pairs which turns out to be computationally in-feasible. Deep Q-Networks (DQN) [28] solve this issue by utilizing a neural network parametrized by weights () to approximate the Q-value (denoted as ) for a given state input. Experience replay improves the stability of the algorithm in which experience tuples are stored in a replay memory (). During training, mini-batches of the experience are sampled uniformly and input to the network to calculate the Bellman residual as the loss given by


Here, represents weights of the target network which is an older copy of network weights lagging behind a few iterations. To achieve a better approximation, the weights are updated using mini-batch gradient descent.

Since an observation received in a POMDP is only the representative of the underlying environment state, holds. However, augmenting a recurrency to DQN integrates the observations over time to better estimate the underlying state, thereby narrowing the gap between and [29]. Hence, we present a memory augmented convolutional neural network architecture to approximate the Q-values from the observations. The performance of the proposed architecture for UAV obstacle avoidance is analyzed in Section IV.

Ii-C Deep Recurrent Q-Network with Temporal Attention

The architecture for approximating Q-values is based on deep recurrent Q-network with attention. This approach essentially keeps track of the past few observations. In the UAV obstacle avoidance application, we keep track of the depth maps obtained from the RGB images. The recurrent network possesses the ability to learn temporal dependencies by using information from an arbitrarily long sequence of observations, while the temporal attention weighs each of the recent observations based on their importance in decision-making.

At time , the proposed model utilizes a sequence of recent observations . Each observation is a depth map which is processed by convolutional layers of the network, followed by a fully connected layer augmented with LSTM [31] recurrent network layer. The DRQN model with LSTM estimates the Q-value , where is the hidden state of the recurrent network and is determined as . The hidden state represents the information gathered over time.

Fig. 2:

Control Network: Architecture of Deep Recurrent Q-Network with Temporal Attention. Number of filters, stride and output size are mentioned for each convolutional layer.

Following the LSTM layer, we propose the use of Temporal Attention [30]

in our model for evaluating the informativeness of each observation in the sequence. Temporal attention optimizes a weight vector with values depicting the importance of observations at the previous instants. This increases the training speed and provides better generalizability over the training dataset. Let

be the vector of feature vectors obtained from the convolutional layers, over the past observations. For each , is a feature vector in . The vector of weights , for the feature vectors, is computed using the obtained hidden state values and the feature vector given by:


in which are all learnable parameters and . In (3),

is an activation function which is computed for every element of the vector given by

. Here, we assume that is the size of an RNN hidden state, is the encoding size of CNN and is the attention matrix size. The activation function is applied pointwise on the vector obtained from .

These weights are normalized using the softmax function


Further, to predict the Q-values a context vector is computed using the above calculated softmaxes and hidden states as:

The obtained context vector is input to a single fully connected layer with ReLU


activation functions that outputs approximated Q-value for each action. The complete model is trained by minimizing a loss function as described in

[29]. The proposed model using temporal attention is illustrated in Fig 2.

Ii-D Obtaining depth maps from RGB images

The UAV’s on-board sensor is limited to providing monocular RGB image data. Effective depth prediction from an RGB image is essential when operating in the physical world. Learning a mapping for image translation , given image pairs

, is a challenging task in the computer vision community. In this work, we propose the use of conditional generative adversarial network (cGAN)


for this image-to-image translation. This approach uses two separate ConvNets (called as Generator and Discriminator) with BatchNorm layers and ReLU activation layers. The Generator (G) ConvNet is an encoder-decoder structure with skip connections, designed to generate realistic fake images taking

and a noise vector z as inputs. The Discriminator (D) network classifies randomly picked images as fake or real with a cross-entropy loss. Let

and represent the weights of the Discriminator and Generator networks respectively. The Generator is expected to produce images close to the ground truth, while the discriminator is supposed to distinguish between fake images and the real images. Hence in a sense, the objectives of these two networks are opposed to each other. The loss function defined below reflects these objectives:


In the above equation, the variable is the RGB image and is its true depth map. The depth map generated by G is denoted as . and

are the probabilities of the image belonging to the real class. Training a cGAN involves a few steps. Initially, the discriminator is trained on real and fake depth images with the correct labels for few epochs. Following this, the generator is trained using the real/fake predictions from the trained discriminator as its objective. This procedure is repeated for few epochs until the generated fake depth maps are difficult to distinguish from the real depth maps. The cGAN architecture is illustrated in Fig.

3. The approach also incorporates loss to generate better near ground truth images.


Hence the final objective of the model can be analytically represented as


where is an adjustable hyper-parameter. In contrast to previous methods ([14]-[17]) our approach learns a loss function adaptable to the input data, making it domain independent and suitable for our problem of intermediate depth prediction for obstacle avoidance.

Fig. 3: Depth Network: Conditional GAN architecture

Ii-E Remarks

  1. It must be noted that the depth maps generated from cGANs as described above still provide limited information with respect to the visual geometry of the environment surrounding the UAV (a similar problem when monocular camera images are used). This issue of partial information was highlighted in Section I. The limited information obtained in stages from cGAN can be stored and collected to make a better navigation decision. The task of using all the relevant partial information obtained in the past is done by the LSTM network architecture as described earlier in this Section.

  2. The deep RL method we propose in this section learns optimal Q-values and the optimal policy for the obstacle avoidance task. There are also other policy improvement approaches for learning a good policy. Recently proposed methods like the Asynchronous Advantage Actor-Critic (A3C) [36], deep deterministic policy gradient (DDPG) [37] and dueling network architecture for double deep Q-networks (D3QN) [38] can also be used with our proposed method. For using these methods, one has to change the loss function (2) for the network architecture. Our method involving temporal attention can be easily integrated with A3C, DDPG and D3QN. However, in this paper, our objective is to highlight the need for using LSTM architecture for partially observable scenarios in UAV obstacle avoidance.

Iii Experimental Setup

Iii-a Depth Network Settings

The proposed conditional GAN is initially trained on a total of RGB-D image pairs collected from the Gazebo[39] simulated environments each having different characteristics. We have a total of 22 different simulated indoor environments, of which few are inspired from [19] while the rest are self designed. The environments consist of broad and narrow hallways, small and large enclosed areas with floorings ranging from asphalt to artificial turf. The simulated environments also contain structured and unstructured obstacles like humans, traffic cones, tables etc., placed at random positions and with random orientation. The walls and obstacles with diverse shapes, textures and colours provide abundant visual information for effective learning. Fig. 4 shows example snapshots of the environment.

Fig. 4: Screenshots of the designed environments in Gazebo. We cover a large range of colors, textures, sizes and shapes for obstacles and walls.

The RGB-D image pairs are collected using a Kinect sensor mounted on the flying drone in simulation, covering all possible viewpoints. Further, the dataset is augmented off-line by random flipping, adding random jitter and random alteration to the brightness, saturation, contrast and sharpness. The network is trained on the entire collected dataset for epochs in batches of size on an NVIDIA Titan X machine. We require the depth network (trained on the simulated images) to predict depth from the unseen real-world images. Predicting depth from either simulated images or real-world images are similar tasks. Thus, it is intuitive to leverage the low-level features learned during training in one task for a different, yet similar task. The basic idea in fine-tuning of depth architecture is exactly this. Once a neural network has been trained on simulated images, the lower layers of the neural network are frozen (so that features learned are kept intact). Then, using the real images, one can just retrain the output layer. By freezing the lower layers, we are using the same features learned earlier to predict depth on the real-world images. The major benefit of this approach is that the network works effectively on similar tasks without the need for training from scratch and also requires substantially low data. In our problem, the network is fine-tuned using and augmented pairs from RGBD-human-explore data [33] and NYU2 dataset [32], respectively.

Iii-B Control Network and Simulation Settings

For RL algorithms to learn an effective collision avoidance policy, the UAV learning agent must have enough experience of undesirable events like collision. Training a learning algorithm on a fragile drone in a physical environment is expensive and hence the performance of DRL algorithms is usually demonstrated on simulated environments. In this work, we build and test our UAV collision avoidance algorithms on the aforementioned simulated environments. Our method initially trains the UAV by starting off with simple hallway environments free of obstacles. Gradually the environment complexity is increased by narrowing down the pathways, enclosing the free space and increasing the density of obstacles. The proposed control network is trained to learn the observation-action value over the last observations (depth images received from the simulated Kinect sensor aboard the UAV) corresponding to the three actions “go straight”, “turn left” and “turn right”, respectively. The agent receives a reward after each step and the reward function is defined as

where is the distance to the nearest obstacle at the decision making instant, is the radius of the drone which is set to m and is the threshold distance which is set to m. The reward function shown above penalizes the action of the controller when it is at a distance less than from the obstacle. If the agent collides, the episode ends with a penalty of . Otherwise, the episode continues until it reaches the maximum number(1000) of steps and terminates with no penalty. The agent also receives an additional reward if it chooses the “go straight” action. The bias for the “go straight” action helps the UAV to always move forward and turn only when there are obstacles in its clear view. Additionally, to cope with the exploration-exploitation tradeoff, a linear annealed policy is utilized during training with initially chosen value of = 1 that drops eventually to 0.05 as the final value. The network hyper-parameter values are as shown in Table I.

Entity Value
Discount Factor () 0.99
Mini-batch size 32
Learning rate 0.0001
Target network update frequency 400
Input observation size 8484
Conv1 layer filter size 88
Conv2 layer filter size 44
Conv3 layer filter size 33
TABLE I: Hyper-parameter values of proposed control network

For the proposed control network to be applicable for robotic applications, the learned policy should be effectively transferable to the real-physical systems. However, this is highly challenging because of the huge gap in visual information available in the real and simulated worlds. Moreover, the depth maps produced by the proposed depth network are too noisy when compared to depth images obtained from the simulated kinect sensor. To overcome this, we degrade the sensor images with Gaussian blurring, random jitter and superpixel replace (replacement probability 0.5) at the time of training. This additional noise is crucial for non-linear function approximators like neural networks to learn and generalize well, making them robust and transferable to real-world systems.

Iv Experimental Results

Iv-a Depth network performance on monocular RGB images

The depth network is trained as mentioned in the previous section. Once trained, we evaluate the performance of the depth network for two measures - the inference speed and the depth prediction quality, respectively. The inference rate of deep learning models is critical when applied to robotic applications, especially when solving for effective collision avoidance models in flying robots. We tested our model on an NVIDIA GeForce GTX 1050 mobile GPU with 8 GB RAM and Intel core i7 processor machine and observed a sufficient enough data rate of 20Hz on an average. In addition, we also implemented previously used depth network in robotic applications [14] and noted an inference rate of 1.4Hz on the same machine configuration.

To assess depth prediction quality of the cGAN architecture, we evaluate the network on unseen simulated data and the fine-tuned data (real-world images) ( and samples respectively). For evaluation, we compute and cGAN loss which has been demonstrated to be a better loss function to generate near ground truth images [34]. Table II depicts the network performance in various scenarios.

Scenario Training loss Testing loss
Training set Testing set cGAN cGAN
simulated simulated 0.106 0.666 1.114 0.711
simulated (same training as previous case) real-world 0.106 0.666 2.779 0.738
simulated + real-world real-world 0.135 0.692 1.792 0.695
TABLE II: Depth network’s quantitative analysis

The first row of values depicts the training and testing loss on manually collected data (data collection is explained in Section III-A). The second row depicts training loss on our simulated dataset, while the testing loss is on a mix of images from the NYU2 [32] and RGBD-human-explore [33] datasets. The third row of values corresponds to the case where the network was trained entirely on the simulated data with fine-tuning. The results in the third row show that such a trained network possesses the ability to generalize well on real-world data. Fig. 5 showcases some samples of the depth maps generated by the cGAN network. The sample images are taken at the Department of Computer Science and Automation, Indian Institute of Science (IISc) and consist of humans (imitating obstacles) and hallways with varying illumination, colour and texture which the network has never seen before. The quantitative and qualitative evaluation depicts that the proposed model provides a remarkable boost to the data cycle rate which is essential in robotic applications and can be effectively transferred to real-world systems.

Fig. 5: Example of depth maps generated by the proposed network (trained on simulated data) for completely unseen real world data with variable illumination, color and texture (Red: far, Blue: near)

Iv-B Control network evaluation

We evaluate the performance of the proposed control network i.e., Deep Recurrent Q-network with Temporal Attention, and compare it with the baseline DQN previously proposed [23]. We also implement two other policies - random and straight. The random policy picks an action with equal probability for each observation, while the straight policy always picks the “go straight” action. The metric used for performance evaluation is the average number of steps taken until collision with an obstacle. Both the DQN and our proposed model are trained in 12 different simulated indoor environments comprising of hallways and rooms with obstacles of varying structures and sizes. Some snapshots of these environments were illustrated in the earlier section. Figures 6 and7 show the learning curves during training for both the algorithms for three different environments. These graphs depict the number of steps the UAV takes until collision. Fig. 7 also shows the performance of DRQN for one such environment.

Fig. 6: Training learning curve of the proposed network and DQN for two different environment settings: (a) An open area with scattered static obstacles of varying sizes and structures (b) Maze like environment with narrow pathways and no scattered obstacles
Fig. 7: Training learning curve of the proposed network and DQN for an environment consisting of an enclosed area with scattered static obstacles of varying sizes and structures

As can be observed, partial observability of the environment hinders the performance of DQN in the obstacle avoidance problem. However, the graph shows that augmenting a memory network with attention is beneficial as it retains crucial information gathered over time and this gives an additional fillip to the learning when compared to the no-attention counterpart.

Iv-B1 Testing in Simulated environments

The trained models are tested on six randomly selected simulated environments out of the twelve environments used for training. The network takes the noisy depth map and outputs the UAV control signal. The output control signal is expected to safely navigate the UAV within the environment for longer duration. Out of the six environments used for testing, three comprise of enclosed areas with randomly scattered static obstacles of varying sizes and structures (named as Env-1, Env-2 and Env-3 in Table III). The fourth environment (Env-4) is a maze like structure with narrow pathways and no scattered obstacles. The fifth environment (Env-5) is a small enclosed area having poles in between. The sixth environment (Env-6) simulates a cafe-like environment and has 7 human actors randomly walking inside the cafe. The actors are not programmed to avoid the moving UAV and their movement paths are completely random. For this cafe-like environment, the model is initially trained with 3 human actors (randomly moving, not designed to avoid the UAV), but tested with 7 moving actors. We analyze the model performance for 200 episodes in each environment and Table III

indicates the average number of steps the UAV takes until collision as well as the standard error. From Table

III, it can be seen that using our approach, the UAV flies for the maximum number of time instants until collision.

Iv-B2 Results

Fig. 8: Snapshots of UAV avoiding randomly moving human actors. The yellow arrows show the path the UAV selects in order to avoid the obstacle.

A snapshot of the testing setup is demonstrated in Fig. 8, depicting the learned UAV model maneuvering in Env-6, effectively avoiding the randomly moving human actors inside a cafe. The proposed DRL model also observes a notable inference rate of 60 Hz on NVIDIA GeForce GTX 1050 mobile GPU, essential for robotic applications.

Fig. 9 illustrates the weights attributed to a sequence of images over the recent past used to find the UAV’s next move. It can be analyzed from the images that in an environment consisting of non-stationary obstacles, predicting the direction of the next step based only on the recent observation (for instance Frame (i) in Fig. 9) is a complicated task. Possessing a memory facilitates an agent to infer the direction of the moving obstacle (such as a human actor walking right) and thereby performing an appropriate action (“turn left”) to avoid collision. It is important to note that our proposed algorithm outperforms DQN on different environments. The advantages of the policy learnt by our method are: (i) the UAV smoothly follows a path while avoiding static obstacles and (ii) in the presence of dynamic obstacles which obstruct the UAV’s view, the UAV skillfully chooses actions to avoid collisions with the dynamic obstacles as well. Video results from these experiments can be seen at

A UAV is a power-constrained system. Thus, a navigation and obstacle avoidance method must be designed in such a manner that it uses the available battery power judiciously. We say that a UAV wobbles when it takes a long sequence of consecutive left and right turns which do not lead to displacement in its position. Thus, the UAV does not cover any distance when it wobbles, but still, power is consumed in this sequence of right-left movements. This motion without displacement is minimized by our method, which naturally leads to a reduction in power wastage. In order to test for energy efficiency, we designed a simulation environment and tested the proposed method as well as the previously proposed algorithm D3QN [24] over it. The simulated environment consists of straight hallway with two turns in between. The navigation task considered is episodic, wherein the UAV starts at a pre-specified initial position. An achievable destination point after the second turn is also specified and the episode terminates when the UAV reaches this destination point. Based on the drop in the battery level and the distance covered, we compute the energy consumption per meter values for both methods by using the power rating of the battery. We observed that for this simulated environment, the average energy consumption over several runs is Wh/m for our approach and Wh/m for D3QN. Thus, this shows that our method achieves a lower value of energy consumption per unit distance traveled when compared to the D3QN method.

Env-1 Env-2 Env-3 Env-4 Env-5 Env-6
Straight 6116 5814 7623 6512 4212 279
Random 12584 176121 11383 16288 12176 4219
DQN 207103 22995 286142 634241 384126 16283
D3QN 248109 271104 297133 658253 414146 17785
Our Approach 323134 342131 326156 764273 652243 24777

Results indicating the average number of steps taken by UAV (along with standard deviation) until collision

Fig. 9: Temporal Attention weights over the most recent =10 observations

V Discussion

Our method has multiple advantages as well as some limitations that we list below:

  • Our approach of adopting cGAN architecture for depth prediction in autonomous aerial systems is novel. Notably, the proposed approach is trained entirely on simulated data and with little fine-tuning on the NYU2 and RGBD-human-explore dataset. The results validate that the model is highly generalizable and qualifies to be adopted in real world applications. As demonstrated by our results, the remarkably high inference rate and transferability of the approach makes it a suitable candidate for intelligent robotic applications.

  • We show in our experiments that augmenting DRL with memory networks and temporal attention facilitates the agent to retain vital information gathered from the past observations. This aids the agent towards making better and informed decisions. This learning ability benefits the autonomous agent to maneuver safely in environments without prior knowledge of the surroundings, as well as in environments with moving obstacles. Furthermore, the agent is competent to move deftly near corners (refer supplementary video) which has been found to be a challenging task for the previously proposed controllers ([19], [23]).

  • The reward function is designed by considering the energy constraints on aerial systems and time factor in navigation tasks. The bias towards the “go straight” action in the reward function ensures that the UAV maintains its course except when avoiding obstacles in its field of view. In addition, when compared to the D3QN approach, the proposed controller gives smoother trajectories and UAV wobbling is minimized that would otherwise cause a lot of energy to be wasted which is highly undesirable in UAV applications. Our control method minimizes this power wastage and yields considerable power savings. The bias towards the “go straight” action might be problematic at intersections, where the UAV has to turn right or left. However, we would like to emphasize that our proposed method handles only obstacle avoidance and can be easily integrated with a high-level path planner that handles the computation of the path from start to goal position.

  • Although the proposed depth prediction network learnt to predict depth maps from the unseen physical world images, the results are noisy. The control network trained with the manually-added noise generalizes and adapts to the noise. However, there is scope for improvement as far as the depth network is concerned. Training depth network on visually high-fidelity simulated data can yield smoother depth predictions.

Vi Conclusions and Future Work

In this paper, we design and analyze the performance of a Deep Recurrent Q-Network with Temporal Attention which is utilized by a deep RL robotic controller for effective obstacle avoidance of UAV in cluttered and unseen environments. The proposed method first utilizes the cGAN network to predict the depth map from a monocular RGB image which is then used to decide the optimal action. The method addresses the problem of partial observability in obstacle avoidance by retaining the crucial information over the long sequence of observations. Experimental results over various settings exhibit significant improvements over Deep Q-Network(DQN) and D3QN algorithms. A potential future direction for our work would be to improve the visual quality of images generated by the cGAN architecture. In GAN architectures, the discriminator block captures the class-specific content from images without imposing constraints on the visual quality of the generated images. The cGAN architecture can be made to generate good quality images by suitably modifying the loss function. Some similarity indices which guarantee structural integrity (e.g., multiscale structural-similarity MS-SIM) can be used for this purpose (see [40]). Another future enhancement would be to use different GAN architectures for depth prediction (see [41, 42]).

The proposed obstacle avoidance method is seen to work well in avoiding obstacles in indoor environments (see Section IV). However, we would also like to test the performance in real outdoor environments. A possible exciting line of research can be to learn concise abstractions of history in recurrent networks, sufficient for optimal decision-making. We would also like to incorporate scene prediction [43] to learn better navigation controls for avoiding obstacles. Regret minimization is another criterion used in RL. Though it has been explored for games like VizDoom and Minecraft [44], the same has not been explored in robotics. It will be interesting to see what policies guarantee low regret in UAV obstacle avoidance and how such policies can be interpreted.