There has been an increasing interest in the area of active perception over the last decade or two. A recent DARPA challenge  poses a problem of exploration of unknown, unstructured, large-scale and cluttered underground environments. Like many other sub-problems from the challenge, the path planning problem suffers from uncertain and incomplete sensor information due to noise and limited field of view respectively. Moreover, the computation time and the effectiveness are the essential attributes of a planning method. Mapping-based techniques perform well to generate a path according to a global mission strategy. Commonly used mapping tools like OctoMap  and Voxblox  have been popular recently, in generating a map representation of an environment using 3D pointclouds. However, mapping is a slow processes. For large environments, planning over such global information contributes to additional computation time overhead. Moreover, the coarseness of the map being used for planning over large scale environments may not be suitable to detect intricacies in the environment. A fundamental requirement for a mapping-based planner to be effective, is a precise position feedback of the robot. In practice, such estimates are expensive to obtain, especially locally, in terms of both computation and payload. Therefore, there is a high chance that an agile robot, such as an MAV, does not map small and thin obstacles. Such obstacles, like hanging cables and pipes, are inherent to many complex indoor environments like caves, mines and warehouses. Hence, there is a need to use faster and higher resolution raw sensor measurements to guide a robot. Moreover, a full 3D solution is required to navigate through uncertain and complicated structures. The method proposed in this paper, utilizes raw sensor measurements to generate robot actions without relying on a map representation of an environment.
Researches have shown in past that living organisms  like insects [10, 3] rely on instantaneous optic flow signals to provide visual cues for reactive obstacle avoidance. One component of an optic flow signal is the sense of how far objects are located in an environment. Depth cameras provide this information in the form of depth images relative to their sensors. A depth image stream can have varying noise levels depending on the type of sensors, the calculation methods involved and abrupt reflections from the objects in the environments. This makes it challenging to obtain useful information to feed into most motion planners that demonstrate well in the perfect observability settings.
Fragoso et al.  proposed a way of keeping only a local egocentric map of the environment in the form of a cylinder that updates in real-time through temporal fusion of limited field-of-view visual information. Some other works ,  and  purpose inflation of depth images using C-Space expansion to facilitate path generation inside a disparity space. Later work by Dubey et al. 
highlights the importance of considering small obstacles like wires, in autonomous robot navigation, by explicitly detecting them using convolutional neural networks. Authors in and  demonstrated closed-loop RRT and LQR based techniques, respectively, for path planning inside a depth image space. Reactive controllers, especially the ones based on Artificial Potential Field (APF) , and bio-inspired [4, 14, 19] methods have demonstrated speed and light computational burden.
This paper proposes a solution for the obstacle avoidance problem that 1) generates robot actions directly from an imperfect depth image stream hence providing an end-to-end solution, 2) has a fast update rate to cater for the needs of agile robots, 3) is computationally feasible for robots with limited on-board resources such as MAVs, 4) adapted for robot maneuvers in 3D to make it useful for arbitrarily structured environments. The perception solution to make such control feasible is the foremost contribution of the paper. A particle filter technique is proposed to probabilistically identify and track regions corresponding to potential obstacles, inside a sequence of depth images. At each time instant, the repulsive potential applied by each candidate for such a region, is weighted according to its probability to compute a total repulsive potential. To close the loop, a user-defined goal point generates an attractive potential and finally the robot is commanded to follow the negative gradient of the resultant potential. Although not uniquely defined in the literature, the term reactive obstacle avoidance typically refers to the methods that require only local environment information to generate steering commands away from obstacles and towards the goal, instead of planning paths inside a global map. Our method slightly deviates from the traditional context of reactive obstacle avoidance in that it uses particle filter on a history of depth images in order to estimate the local environment information. However, the APF controller in-loop only requires the local information and no map information is available or saved during flight, hence, the method proposed in this paper falls under the umbrella of reactive methods for obstacle avoidance. Fig. 2 shows the simplified process flow, various components of which are explained in the following sections.
The rest of the paper is organized as follows. Section II highlights the problem formulation followed by the description of the filter design, in Section III. Section IV provides a description of how the control loop is closed with the perception information. Section V presents the results and analysis from physical experiments. Finally, Section VI concludes the discussion.
Ii Problem Setup
Problem 1: Let be a set of all collision free robot configurations, be the goal region and be a set of available robot actions. Given a history of depth images , at any time instant , choose an action such that and , at some finite final time .
We aim to solve a well-defined problem of moving a robot from an initial to a final point while avoiding collisions. In the APF approach, the robot behaves as a point charge. The goal point generates an attractive potential for the robot which vanishes as the robot gets closer to the goal. The proximity obstacles are treated as a collection of point charges to generate a repulsive potential to the robot. The robot maneuvers towards the goal, following the negative gradient of the net potential. If is the vector from the robot’s body frame origin to the goal and is the obstacle’s position in the robot’s body frame, the attractive and repulsive potential functions, and , are given by,
Here, defines the sensing horizon over which the repulsive potential is effective. A quadratic potential well is a typical choice for the attractive potential function since it leads to a control law that is linear and all other potentials approximate to quadratic for small perturbations in . However, the conic potential  provides a constant control law except at the goal . This behavior is more suitable around the obstacles. The conic attractive function is, therefore, often used close to obstacles and is switched to the parabolic function in close proximity of the goal. In (1), refers to the radius around the goal where the conic potential function switches to the parabolic potential function. In order to ensure continuity, the conic and quadratic potential gains are set to and respectively.
An accurate knowledge of an obstacle’s location relative to the robot, , is an important variable for the effectiveness of the APF method. Unlike an RGB image, a single channel depth image encodes this information up to an allowable camera resolution. However, depth cameras may have various types of imperfections, making it difficult to rely on the raw output, especially when a system is very sensitive to an incorrect feedback. The following section explains the proposed observer design to estimate such information from a sequence of depth images. Fig. 1 shows a depiction of the APF-PF problem.
In a recent effort to simulate a No-Depth-Return (NDP) type of noise, authors in  referred to depth imaging as an area in robotic perception that suffers from a large simulation-reality gap due to scene-dependent noise. This sort of noise is exceptionally difficult to model hence the literature provides little insight into this topic. We base our model on the noise characteristics that adversely affect the detection of potential obstacles in a depth image. These considerations include the density of the clusters, their distance from the camera and their ability to appear, disappear and move during the consecutive image frames. The goal of this problem is to detect candidates for the occupied region inside a depth image. Particle filtering is a sampling-based approach to state estimation that can handle large state spaces.
The particle filter problem parameters can be defined as a tuple , where and define the set of all possible states and observations respectively. In our case, is a set of perception states which refers to the occupied regions inside a depth image. This encodes the information about without requiring the robot and the obstacles to be mapped in a global frame to generate the repulsive fields. The probability of a state , at any time instant , to transition to a state , at the next time instant, is defined as . Similarly, is the probability of an observation , given a transition to a state . A particle filter outputs a belief
at every iteration which refers to the probability distribution over all possible states given an observation history. This encodes the relevant information from the entire sequence of observations without needing to explicitly store it.
The following subsections explain various components of the perception problem.
Iii-a State Space Transition Models
Any point in an image can correspond to an obstacle in a completely unknown environment and, hence, can potentially be a state. However, at this resolution, the state space size can be large. For a 640480 image, it is 307200 times the number of possible depths at which an obstacle can be located. This requires the flexibility to discretize such information. We discretize the input image in a 2.5D fashion. This translates to discretization over pixels in the plane parallel to the camera and over depth in the plane perpendicular to the camera. This allows us to make detections directly inside a depth image space in the camera frame, , without projecting the whole image to a 3D space for processing. Moreover, the discretization reduces the state space size up to a user defined refinement depending on the amount of compute available. This results in a voxel grid where the 2.5D coordinate of each voxel belongs to the state space. When projected to 3D, it forms a pyramid in case of a pinhole camera model, with apex corresponding to the optical center of the camera. This, hence, does not require explicit knowledge of the camera model during the perception process. The 2.5D voxel grid covers the entire sensing horizon (field-of-view). Fig. 3 shows such a discretization process, along with the relevant frames of reference. The full state space can be written as,
where are the pixel discretization steps along the width and height of an image and refers to the discretization step along the depth. Similarly, and are the number of discretization steps along the respective axes. In order for the belief to sum to 1 at each time instant, the state space model also includes a state, , referring to obstacle-free sensing horizon. The probability of this state, in the belief vector, informs about the likelihood of there being no obstacle in the field-of-view. In the state space model, this state refers to the voxel grid boundary. The idea is that when there is no obstacle inside the sensing horizon, there is always an obstacle surrounding the voxel grid almost surely, and with non-zero probability of transitioning to the voxel grid.
The state transition probability is defined as the product of two factors:
models the relative motion of the obstacle using a normal distribution centered at the current state,, with depending on the expected speed of the environment obstacles relative to the robot. This means that an occupied voxel is more likely to transition to a nearby voxel than to a farther one. The states that are close to the voxel grid boundary are highly likely to transition in and out of the field-of-view. The second term, , biases transitions toward the vehicle. Specifically, this bias is defined with a normal distribution centered at zero, , where is the depth associated with the state . The bias helps the perception module to make a more conservative prediction of an obstacle’s behavior i.e., coming towards the robot with high probability. Moreover, the closer proximity obstacles have a significantly higher contribution to the repulsive potential than the farther ones. The bias towards approaching obstacles plays an important role in ensuring a safe maneuver for a fast moving robot.
Iii-B Observation Model
An observation is given by,
In the expression above, represents the number of points contained in the voxel . The observation model refers to the probability of an observation, , after transitioning to a state i.e., . For any state inside the voxel grid, , this can be written as,
Dense cluster of points inside a depth image serves as an important feature to determine that a region is a valid projection of a physical environment object. Therefore, in order to gain confidence on a voxel, the number of points contained within it is observed. Hence, we can conveniently exploit independence of variables. Equation 4 can then be reduced to,
where is the number of points contained in the state .
Using Bayes’s rule on the term , and assuming no prior information about for any voxel ,
Here is the maximum number of points that can be contained in a voxel and depends on the discretization intervals. The tuning parameter, , determines the number of points required to gain confidence on any state inside the voxel grid. Low values result in conservative detections where large number of points are required to be certain about a state.
The probability that there is no obstacle inside the voxel grid depends on the number of points in the most populated voxel. This is designed so that, for the system to believe that there is an obstacle in the field-of-view, a significantly large cluster of points needs to exist in atleast one of the voxels. This models the behavior of the outliers in a depth image and helps filter out sparse noise very efficiently. A larger value ofhelps reject greater number of outliers per voxel. The coexistence of the models defined by (6) and (7) is shown in Fig. 4. By appropriately varying and , a variety of different filtering behaviors can be achieved.
Iii-C Particle Filter Algorithm
The particle filter algorithm is executed on the state and observation models keeping in view the applicability of the approach to systems that require fast and real-time feedback. The transition probabilities, , depend on the image and discretization parameters which are kept constant throughout the process. Therefore, the first step is to calculate a transition probability lookup table for all possible states. At the start of the run, the belief is initialized with equal number of particles for every state. Every time an image is received, it is first discretized and each voxel is assigned the number of points contained within it. All the particles in the belief vector are then propagated forward in time by randomly drawing, for each particle, a next state from the distribution defined by . Each propagated particle is then assigned a weight based on the observation model (6), (7). Finally, the particles are drawn at random according to the distribution of weights.  The probability distribution over the entire state space given the history of depth images, can be computed from the belief vector at any time step. Algorithm 1 summarizes the estimation process. In the particle filter formulation, the belief about a state is represented by the number of particles associated with that state. Probability of a state, , at any time instant , can therefore, be obtained by,
where is a vector keeping track of the number of particles with length equal to the total number of states.
Iv APF Control
The mapping between a pixel inside a depth image to its 3D coordinates in the camera’s optical frame, , is defined by the camera’s projection model. For a pinhole camera model, its governed by the following set of equations,
where the intrinsics are defined by the camera’s focal lengths, , , and the location of its optical axis, , in pixels. The coordinates of a voxel at time instant are referred to as , with the projected 3D point being . Using (1), (2), (9), and a belief, , the net potential, , on the robot is found as,
where, is the transformation from the camera frame, , to the robot’s body frame, .
V Implementation and Results
For the ease of integration on an actual robot, we program the perception and control stack in Robot Operating System (ROS) C++ environment. The first subsection examines filter’s performance to detect a thin cable in the presence of varying levels of added noise. The second subsection shows results from the flight testing on a real quadrotor UAV robot. For both of the tests, an Intel Realsense D435  camera is used to obtain the depth information. The image resolution is set to 640480 pixels.
V-a Thin Cable Test
Thin and small obstacles pose challenge to flying robots since they are extremely difficult to be mapped. Hanging cables and wires serve as one of the worst case scenarios for this category of obstacles. An extension cord, 8 mm in diameter, is chosen to put filter’s performance to test while keeping the robot static. Fig. 5(a) shows the RGB image of the hanging cord as seen by the forward facing camera on the stationary robot. Figures 5(b),(c) show two snapshots of the depth image when projected to the robot’s body frame coordinates. Each 3D point in the figures corresponds to a pixel in the depth image. The maximum number of points an object can occupy in a 640480 image is 307,200. However, the camera only manages to see less than 300 points on the cord, diminishing to far less than that in many frames.
In three different tests, we added gaussian noise of standard deviation 0.2 m to the depth associated with every pixel, every other pixel and every 10pixel in every input image. A fourth test is performed as a control without any noise. The discretization intervals are chosen to be . The state transition and observation models’ covariance parameters are set to . Additionally, 20,000 particles are used for particle filter estimation. In the APF-PF framework, the repulsive potential applied by each depth image voxel is weighted according to its probability. The probabilities, and hence the influence of the repulsive fields generated by different voxels, are relative. Therefore, observing the most probable voxel provides a useful insight into the perception performance while keeping the illustration legible. Fig. 6 shows the depth of the most probable voxel transitioning over time for all four noise levels. The size of each vertex in the plot is set proportional to the belief of the corresponding voxel. When the noise is added to every pixel on the cable, there is no information in the image and hence the detections do not necessarily correspond to any physical object in the scene. A bias can be seen towards closer distances due to term in (3). As the noise gets less dense, the detections appear to happen where the cord is located, with increasing confidence as the density of noise decreases. Like any other observer, there is a performance trade-off if an object of interest resembles the expected sensor noise, as is the case with the cord. Therefore, even with no added noise, the valid detections get less certain during some time instances. This is also because in some frames, for instance Fig. 5(b), the structure of the cord disappears to very less number of points before reappearing again in the depth image. The filter does reasonably well in keeping track of the object during those instances. Nevertheless, if the objects of interest in the depth image are more observable or if the noise is more sparse, and can be varied to get more accurate detections.
V-B Physical UAV Experiments
We performed several flight tests in a high bay area approximately 17 m long and 8 m wide. Results from all of the flight tests are shown in this section. The quadrotor UAV (Fig. 7) is built out of a 540 mm airframe equipped with a forward facing Intel Realsense D435 depth camera. The localization estimates on the UAV are provided by a 3D Ouster LiDAR [21, 12] along with a PX4-based flight controller . The Ouster LiDAR being used has a maximum range of approximately 100 m, a resolution of 102464 with horizontal and vertical field-of-views of 360 and 45 degrees, respectively. This makes it very well-suited for localization, but the sparseness of its depth image and the performance degradation at close ranges motivates us to use the Intel Realsense for small obstacle avoidance. Since, we are relying on one camera for these experiments, its utility is increased by adding the non-holonomic constraints on the robot in the plane of its body frame, . The steering command is generated so that the UAV faces the negative gradient of potential, , defined in .
where , , and are the forward and vertical velocities and the yaw rate in , respectively. The maximum velocities are regulated by , , and . Additionally, , , and are the , and components of . The APF method inherently assumes a holonomic vehicle model. However, by setting sufficiently larger than , the vehicle’s translation closely approximates the holonomic behavior by making yaw to converge much faster than any translation component. In order to avoid the forward velocity to be negative, is lower bounded by zero.
The covariance parameters, and , for the particle filter, are set to 170 and 60 respectively, while keeping all other parameters the same. The control parameters are used as, , , , , , . Fig. 8 shows the setup with randomly placed obstacles of different reflectivity and of diameters, 270 mm, 90 mm and 22 mm in the course. Each UAV flight path starts from one of the four regions A, B, C or D and ends in one of the three regions B, C or D. The attractor point corresponding to each region is located at its center. The combined duration of the flight tests is approximately 8 min with more than 150 m of distance traveled by the UAV safely. The UAV performs 3 A to B (A-B), 9 B to C (B-C), 8 C to D (C-D), 1 D to C (D-C) and 6 D to B (D-B) flights. The paths followed by the UAV for all the flights, as reported by the LiDAR odometry, with the locations of the obstacles and attractors are shown in Fig. 10. For the purpose of further analysis, Fig. 9 shows the performance of the proposed method over one randomly chosen A-B-C-D run. If the APF method was applied directly on the raw sensor measurements, the area of lowest depth would make the most contribution to the repulsive field. In that case, even low levels of noise can prove highly compromising to system’s safety especially if such an area corresponds to noise. The plot shows the depth of the most probable voxel compared against the depth of the closest point inside the raw depth image, at each time instant. The magnitude of the repulsive force is overlaid on the plot. In order to highlight the frequency of raw sensor noise, we hand-labelled the sections in the plot where the UAV had a physical object inside its field-of-view. These sections are shaded green in the figure. At various instances throughout the flight path, the UAV perceived noise that did not belong to any physical object. The repulsive force magnitude is a direct measure of the vehicle’s response to those observations. Lastly, the computation time for the complete software stack, over several thousand frames, is shown in Fig. 11.
The paper proposed a perception technique to effectively use an APF-based motion planner on a raw depth image stream. Unlike popular map-based planners, the method is well suited for real-time maneuvers in agile robots especially when precisely building a map is not feasible. The technique is shown to work effectively to perceive and avoid small and thin obstacles from a sequence of noisy depth inputs. The noise level, density and frequency varies with lighting conditions, reflectively of surrounding objects, and other environment artifacts. Moreover, depth from stereo is governed by different optics and computation methods than that of a Time-of-Flight (TOF) technology. The method proposed in this paper provides a flexible probabilistic framework to filter out various noise types by carefully tuning the relevant parameters. Finally, the robustness and efficiency of the perception and control stack is demonstrated through physical UAV experiments.
The video of all the flight tests is posted at https://youtu.be/oB-WNkrNwR8.
This work was supported through the DARPA Subterranean Challenge, cooperative agreement number HR0011-18-2-0043.
-  (2019) Real-time quadrotor navigation through planning in depth space in unstructured environments. In 2019 International Conference on Unmanned Aircraft Systems (ICUAS), pp. 1467–1476. Cited by: §I.
-  (1983) Control of manufacturing processes and robotic systems, chapter impedance control as a framework for implementing obstacle avoidance in a manipulator. Boston. Cited by: §II.
-  (2002) Neural networks in the cockpit of the fly. Journal of Comparative Physiology A 188 (6), pp. 419–437. Cited by: §I.
-  (2009) Implementation of wide-field integration of optic flow for autonomous quadrotor navigation. Autonomous robots 27 (3), pp. 189. Cited by: §I.
-  (2020)(Website) External Links: Cited by: §I.
-  (2020)(Website) External Links: Cited by: §V-B.
-  (2017) Droan—disparity-space representation for obstacle avoidance. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1324–1330. Cited by: §I.
-  (2018) DROAN-disparity-space representation for obstacle avoidance: enabling wire mapping & avoidance. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6311–6318. Cited by: §I.
-  (2018) Dynamically feasible motion planning for micro air vehicles using an egocylinder. In Field and Service Robotics, pp. 433–447. Cited by: §I.
-  (2001) Fly flight: a model for the neural control of complex behavior. Neuron 32 (3), pp. 385–388. Cited by: §I.
-  (2000) New potential functions for mobile robot path planning. IEEE Transactions on robotics and automation 16 (5), pp. 615–620. Cited by: §II.
-  (2016) Real-time loop closure in 2d lidar slam. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1271–1278. Cited by: §V-B.
-  (2013) OctoMap: an efficient probabilistic 3D mapping framework based on octrees. Autonomous Robots. Note: Software available at http://octomap.github.com External Links: Cited by: §I.
-  (2009) Bioinspired visuomotor convergence. IEEE Transactions on Robotics 26 (1), pp. 121–130. Cited by: §I.
-  (2020)(Website) External Links: Cited by: §V.
-  (1986) Real-time obstacle avoidance for manipulators and mobile robots. In Autonomous robot vehicles, pp. 396–404. Cited by: §I.
-  (2015) Decision making under uncertainty: theory and application. MIT press. Cited by: §III-C.
-  (2014) Stereo vision-based obstacle avoidance for micro air vehicles using disparity space. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3242–3249. Cited by: §I.
-  (2020) Reactive control and metric-topological planning for exploration. In 2020 IEEE International Conference on Robotics and Automation, Cited by: §I.
-  (2017) Voxblox: incremental 3d euclidean signed distance fields for on-board mav planning. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §I.
-  (2020)(Website) External Links: Cited by: §V-B.
-  (2009) Efficient c-space and cost function updates in 3d for unmanned aerial vehicles. In 2009 IEEE International Conference on Robotics and Automation, pp. 2049–2054. Cited by: §I.
-  (2019) A supervised approach to predicting noise in depth images. In 2019 International Conference on Robotics and Automation (ICRA), pp. 796–802. Cited by: §III.
-  (1990) Manipulator control with superquadric artificial potential functions: theory and experiments. IEEE Transactions on Systems, Man, and Cybernetics 20 (6), pp. 1423–1436. Cited by: §II.
-  (2001) Optic flow is used to control human walking. Nature neuroscience 4 (2), pp. 213–216. Cited by: §I.