GPU Accelerated Voxel Grid Generation for Fast MAV Exploration

12/25/2021
by   Charbel Toumieh, et al.
0

Voxel grids are a minimal and efficient environment representation that is used for robot motion planning in numerous tasks. Many state-of-the-art planning algorithms use voxel grids composed of free, occupied and unknown voxels. In this paper we propose a new GPU accelerated algorithm for partitioning the space into a voxel grid with occupied, free and unknown voxels. The proposed approach is low latency and suitable for high speed navigation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

11/17/2019

Robotic Sculpting with Collision-free Motion Planning in Voxel Space

In this paper, we explore the task of robot sculpting. We propose a sear...
05/30/2018

Multi-Resolution 3D Convolutional Neural Networks for Object Recognition

Learning from 3D Data is a fascinating idea which is well explored and s...
02/28/2018

Resolution Improvement of the Common Method for Presentating Arbitrary Space Curves Voxel

The task of voxel resolution for a space curve in video memory of 3D dis...
08/10/2020

RocNet: Recursive Octree Network for Efficient 3D Deep Representation

We introduce a deep recursive octree network for the compression of 3D v...
09/27/2021

G-VOM: A GPU Accelerated Voxel Off-Road Mapping System

We present a local 3D voxel mapping framework for off-road path planning...
02/18/2020

Voxel-Based Indoor Reconstruction From HoloLens Triangle Meshes

Current mobile augmented reality devices are often equipped with range s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Many sensors (RGB-D cameras, stereo-matching …) output dense pointclouds as measurements and need to be processed and turned into an environment model/representation for motion planning. Fast 3D environment modelling is crucial for real-time high speed motion planning and exploration.

Many state-of-the art techniques use voxel grids for planning [ryll2019efficient] [tordesillas2020faster] [tordesillas2018real] [toumieh2020planning]. Some of them use the grid for Safe Corridor generation [liu2017planning] [toumieh2020convex], while others use it for direct collision checking. They need the grid to be partitioned into occupied, free and unknown voxels. The fast generation of such grids is the main objective of this paper.

We will first present the related work as well as our main contributions. Then, we will outline the different steps of the method and show the simulation results. Finally, we will compare it with mit-acl-mapping111https://gitlab.com/mit-acl/lab/acl-mapping used in [ryll2019efficient]. Our method is implemented on CPU and GPU and we will discuss the performance difference throughout the paper.

I-a Related Work

Numerous 3D space representations exist such as signed distance fields [oleynikova2017voxblox] [han2019fiesta], octrees [hornung2013octomap] [duberg2020ufomap], and voxel grids [ryll2019efficient].

Signed distance fields grids include in every voxel information about the distance and gradient against obstacles and is essential for gradient-based planning methods. The advantage of voxel grids over signed distance fields is processing time, since they don’t need to calculate the distance of every voxel to the closest obstacle.

Octrees are a tree data structure where each node has eight children. Octree based representations such as Octomap [hornung2013octomap] are an efficient probabilistic 3D representation of the environment. The advantage of voxel grids over octrees is the constant voxel access (read/write) time.

In [hermann2014unified], the authors use General Purpose Graphics Processing Unit (GPGPU) to populate the voxel grid. The authors raycast every measurement point in a Brensenham fashion [bresenham1965algorithm] and update the voxels according to the sensor model.

In [ryll2019efficient] authors use [bresenham1965algorithm] to trace every pixel of the image and free the traversed voxels between the camera center and the pixel depth. This approach leads to a high computational cost and become intractable for medium/high resolution RGB-D cameras.

I-B Contribution

In this work we propose a new implementation of a voxelization algorithm on the GPU that tries to minimize computation time as much as possible. It takes as input a dense point cloud and outputs a local voxel grid centered at the robot with occupied, free, and unknown cells. However we don’t use a probabilistic approach as this reduces performance of the GPU in our implementation (requires atomic operations).

Ii Nomenclature

We define the various abbreviations and variables used throughout this paper in table I. All distances and coordinates are in meters except for voxel coordinates which are integers.

local voxel grid
measurement voxel grid
grid size in the direction
grid size in the direction
grid size in the direction
voxel cube side length
camera field of view in the direction
camera field of view in the direction
IR depth sensor range
the in number of voxels
voxels covered by at
voxels covered by at
voxel coordinate in the grid frame
voxel coordinate in the grid frame
voxel coordinate in the grid frame
transform from camera to world frame
transform from world to grid frame
obstacle point in the camera frame
obstacle point in the voxel grid frame
camera position in the voxel grid frame
number of voxels to inflate
direction of the ray to trace
starting point of the ray to trace
maximum ray traversal distance
Table I: Nomenclature
Figure 1: We show a 2D example of the grid with the origin set such as the robot/camera position is at the center of the grid.

Iii GPU architecture

In this section we briefly describe the GPU architecture/nomenclature and how it executes instructions. This will help make sense of the performance benchmarks in the simulation section. We will describe NVIDIA’s Turing architecture which our GPU (RTX 2060) is built on [cuda_c].

The basic building block of a GPU is the Streaming Multiprocessor or SM. Each SM in the Turing architecture contains 64 cores (2 groups of 32 cores) and has its own L1 shared memory. In every group of 32 cores, all cores execute the same instructions simultaneously, but with different data (Single Instruction Multiple Threads - SIMT). Thus we get 32 threads (called a warp) all doing the same thing at the same time.

The warps (groups of 32 threads) are grouped into blocks and every block can be executed on one SM only. A group of blocks is called a grid.

Throughout the paper we will measure the efficiency of our implementations through active threads per warp (how many of the 32 cores are active - the higher the better), and the number of warp cycles per executed instruction (the lower the better).

Active threads per warp also depend on predication when we have if - else conditional statements. With predication, the GPU evaluates both sides of the conditional statement and then discards one of the results, based on the value of the boolean branch condition in each thread. Predication is in general effective for small branches.

The GPU code is written in CUDA [nickolls2008scalable].

Iv The method

The objective is to generate a voxel grid representation of the world around the robot. The method takes a dense point cloud (stereo matching) or a depth image (IR depth sensor) in the camera frame, and populates a voxel grid with occupied, free and unknown cells.

The method consists into updating a sliding local voxel grid centered around the robot’s position with a measurement grid that represents the latest measurement. It is divided into 5 steps (functional blocks):

  1. Voxel grid setup

  2. Populate occupied voxels.

  3. Ray-trace to free voxels in camera field of view.

  4. Update the local voxel grid with the measurement voxel grid.

  5. Shift the local grid to be centered at the robot’s position.

A diagram showing a global view of the execution pipeline of steps 1-4 is shown in Fig. 2.

Figure 2: We show the execution pipeline at iteration of steps 1-4 after receiving a measurement (point cloud). In the presented case, an dynamic obstacle moves away from the camera while the robot stays static. Free voxels are white, occupied are red, unknown are grey, unknown and traced are black. The camera field of view is shown in green and the point cloud as red circles.

In the voxel grid setup, we initialize the local voxel grid and measurement voxel grid. The local grid is only initialized once at the start of the algorithm and is updated at every measurement with the measurement voxel grid that is reset before each measurement. We then set the voxels that contain obstacles to occupied in the measurement voxel grid (populate occupied voxels).

An efficient ray-tracing method is used in the third step (Ray-trace to free voxels in camera field of view). Instead of tracing each pixel we leverage the voxel grid structure to reduce considerably the number of rays to trace which accelerates the overall processing time. It is inspired by [Yguel2006EfficientGC] and [oleynikova2017voxblox] who uses ray bundling.

We then update the local voxel grid with the measurement one which we populated with the occupied voxels and freed its voxels with ray-tracing. Finally we shift the local grid so that it is always centered at the robot position as the robot moves. We assume a regular voxel grid, but the method can be generalized for irregular grids. Each step as well as its GPU implementation are discussed in what follows.

Iv-a Voxel grid setup

We have 2 grids: the local voxel grid that is the representation of all previous measurements, and the measurement voxel grid that is the representation of the latest measurement. Both grids are of the same size. It is possible to use only one grid (the local grid). However, this would require clearing all the voxels in the camera field of view (using ray-tracing) before applying steps (2-4). We found empirically that this would result in a higher computation time then using 2 grids and merging them together (due to the high computation cost of ray-tracing).

The origin of the local voxel grid is initialized such as the initial robot/camera position is at the center of the grid. The orientation is the same as the ENU (East North Up) frame (Fig. 1). It is then shifted as the robot moves. All voxels are initialized as unknown, and will be updated when the measurement voxel grid is merged with the local voxel grid. The voxels of the measurement voxel grid are set to unknown before every single measurement. They are then changed by steps 2 and 3.

Figure 3: We show a 2D example of the rays (in green) we trace using our method. The field of view of the camera is limited by the red lines.
(a) Simple ray-tracing
(b) Our ray-tracing
Figure 4: We show the points of a dense pointcloud in red, the unknown voxels in grey, occupied voxels in red, free voxels in white and unknown and traced in black. Our method (b) traces less rays with a close end result to (a). This speeds up computation time.

Iv-B Populate occupied voxels

In this step, we set all the voxels in the measurement grid that contain obstacle points to occupied using Alg. 1.

This is done by first transforming the points from the camera frame to the voxel grid’s origin (line 2), then dividing the coordinates by the voxel size to get the integer coordinates of the occupied voxel (lines 3-5). For every measurement/camera pose, the transformation matrix is the same: . It is calculated on the CPU and passed as an argument to the GPU implementation.

Furthermore, certain planning methods require inflating the obstacles by a number of voxels . This can be implemented in this step (lines 6-9).

On the GPU, the frame transformation is done in parallel as well as setting the corresponding voxels to ”occupied”. If multiple threads in a warp want to concurrently set the same voxel to ”occupied”, only one of them succeeds while the others are discarded [cuda_c].

1:function AddOcc()
2:     
3:     
4:     
5:     
6:     for  to  do
7:         for  to  do
8:              for  to  do
9:                                               
Algorithm 1 Populate occupied voxels

Iv-C Ray-trace to free voxels

In this step we free all the voxels in the measurement grid between the center of the camera and the occupied voxel since if any of them contained an obstacle, it would have been detected by the dense pointcloud representation.

Instead of tracing every point that is in a dense pointcloud, or every depth pixel in a depth image, we adopt another approach that significantly decreases the number of rays to trace (Alg. 2).

First we determine the which we want to clear (e.g. range of the IR depth sensor) and calculate the number of voxels covered by it. This is done by dividing the by the voxel size .

Then, we determine with the field/angle of view of the camera in the x direction (Fig. 3).

We also determine with the field/angle of view angle of the camera in the y direction.

Note that , and are rounded to the closest integer.

Using [amanatides1987fast]

, we trace all the rays who start from the camera center, with each ray having as direction one of the vectors going from the camera center to the voxels whose integer coordinates are:

We subtract 1 from and

since by construction they are odd numbers (Fig.

3). Since the camera frame is not always aligned with the voxel grid frame, we have to transform the ray direction vector from the camera to the voxel grid origin frame (line 4).

The voxels that the ray traverses before hitting an occupied voxel are set to free. All the voxels traced after hitting the occupied voxel are set to unknown and traced so they are differentiated from the unknown voxels outside the sensor field of view during the update (lines 8-14). Each ray is stopped when it traverses a predefined distance (e.g. distance or ) Fig. 4.

This ray-tracing method assumes that all the obstacles in the camera field of view and within a certain depth are detected, which is ensured with RGB-D cameras or stereo-matching using state-of-the-art methods [tankovich2020hitnet].

On the GPU, we may have concurrent writes by some threads. This results in an undefined behavior on the GPU, as some voxels may have different values written to them concurrently by different rays. This only affects voxels on the borders of a region occluded by an obstacle, and may not affect the overall performance depending on the application.

1:function RayTrace()
2:     
3:     
4:     
5:     
6:     
7:     
8:     while  do
9:         move by a voxel using [amanatides1987fast] and set to      new traced voxel
10:         if  then
11:              
12:         else
13:                        
14:               
Algorithm 2 Ray-trace to free voxels

Iv-D Update the local voxel grid

After populating the measurement voxel grid with occupied voxels and ray-tracing to free voxels, the local voxel grid is updated with the free, occupied, and unknown and traced voxels, i.e. the previous values are replaced with the values of the free, occupied, and unknown and traced voxels of the measurement voxel grid (Alg. 3, line 2-4).

The grids are 1 dimensional arrays with the index .

On the GPU the update is done in parallel resulting in a significant speed up. The speed up is due to being able to simultaneously access different memory addresses of the same array.

1:function UpdateGrid()
2:     if  then
3:               
Algorithm 3 Update local grid with measurement grid

Iv-E Shift the local voxel grid

Every time the robot moves by a voxel or more, the local map is shifted to be centered at the new robot position and the new voxels resulting from that shift are initialized as unknown. This allows to always have the most relevant/close obstacle information to the robot. The shift is done in voxel units so we can simply use/copy voxels from the old local voxel grid (before shifting).

The shift is done after the local map is updated. This way it doesn’t add any unnecessary latency. This step is solely done on the CPU.

V Simulation results

(a) A comparison between the CPU and GPU implementation of step 2 (populating occupied voxels) of our method.
(b) A comparison between the CPU and GPU implementation of step 3 (ray-tracing to free voxels) of our method.
(c) A comparison between the CPU and GPU implementation of step 4 (merging the measurement grid with the local grid) of our method.
Figure 5: A comparison between the CPU and GPU implementation of steps 2-4. The CPU implementation is sequential and runs on a single core.

We run the simulation on the following hardware setup: for the CPU we use Intel Core i7-9750H up to 4.50 GHz, and for the GPU we use NVIDIA’s GeForce RTX 2060 up to 1.62 GHz.

We simulate the robot using mit-acl-gazebo222https://gitlab.com/mit-acl/lab/acl-gazebo. The sensor is an RGB-D camera that outputs a depth image with a resolution of . The maximum is 6.5 meters, and . The chosen size of the voxel grid is and the voxel size . The robot size is which implies . This results in voxels, and rays to trace. These are the standard parameters that we use for the simulation time comparisons unless specified otherwise.

We compare the performance of every step run on the CPU with its GPU version. We analyse the efficiency of our GPU implementation in terms of warp state statistics and warp cycles per execution instruction. The occupancy metric of the GPU is not studied as it depends on other factors than the efficiency of the algorithm such as the number of blocks and threads resulting from the pointcloud size/number of rays to trace/number of voxels in a grid. We choose 128 threads per block (4 warps).

V-a Populate occupied voxels - simulation

We compare the computation time of the CPU and GPU implementation of the functional block ”populate occupied voxels” as the number of obstacle points increases (Fig. 4(a)). The scale of the and axes is logarithmic. The GPU computation time starts quasi-constant then increases linearly as the number of points surpasses . This is due to the GPU not being fully occupied before reaching the inflexion point. The CPU computation time scales linearly with the number of points in the pointcloud. This is due to the fact that the time complexity of processing one point is constant.

The GPU computation time slope in the linear segment is smaller then that of the CPU. This is expected due to the SIMT architecture and the efficiency of our implementation.

The GPU implementation outperforms the CPU implementation for points and the difference grows bigger as the number of obstacle points increases. For point the difference in performance is .

GPU performance

The average active threads per warp (with predication) is , and the average not predicated off threads per warp is . The average warp cycles per executed instruction (which defines the latency between two consecutive instructions) is 11.85. This shows that our implementation for this step is efficient on the GPU.

(a) case 1: our method
(b) case 1: mit-acl-mapping
(c) case 2: our method
(d) case 2: mit-acl-mapping
Figure 6: We show the voxel grid generated by our method and mit-acl-mapping during a MAV exploration task using [tordesillas2020faster]. The occupied voxels are shown in orange, the unknown voxels in blue, and the free voxels are transparent. The two methods deliver close results.
Populate
occupied voxels
Ray-trace
to free voxels
Update local grid
with measurement grid
CUDA
memory operations
Total
CPU -
GPU
Improvement -
Table II: CPU and GPU average computation time comparison for the different functional blocks using the standard parameters (detailed in Sect. V) during an exploration task. The voxel grid setup is not included (negligible time) and the local grid shifting is not included (done only on the CPU after the local map is updated - doesn’t affect latency).

V-B Ray-tracing to free voxels - simulation

We compare the computation time of the CPU and GPU implementation of the functional block ”ray-tracing to free voxels” as the number of rays to trace increases (Fig. 4(c)). The scale of the and axes is logarithmic.

Both CPU and GPU computation times scale linearly. This is expected as the computation time of a single ray-tracing function is limited by and thus has a constant upper bound. The slope of the GPU computation time is also smaller then that of the CPU in this case. The GPU implementation outperforms the CPU implementation for rays and the difference grows bigger as the number of rays to trace increases. For rays the difference in performance is .

GPU performance

The average active threads per warp (with predication) is , and the average not predicated off threads per warp is . The reason is that we use the method described in [amanatides1987fast] for ray tracing, which includes conditional if-else statements that cause branching. Branching in general degrades GPU performance and leads to less active threads per warp and wasted cycles. The average warp cycles per executed instruction is .

Figure 7: A comparison between mit-acl-mapping [ryll2019efficient], the CPU and GPU implementation of our method. The red line represents the median. The lower and upper bounds of the box represent the 25th and 75th percentile respectively, and the lower and upper whiskers represent the minimum and maximum respectively.

V-C Update the local voxel grid - simulation

We compare the computation time of the CPU and GPU implementation of the functional block ”update the local voxel grid” as the number of voxels in the grid increases (Fig. 4(c)). The scale of the and axes is logarithmic.

Both CPU and GPU computation times scale linearly. This is expected as the computation time is constant (read/write access). The slope of the GPU computation time is also smaller then that of the CPU in this case because of the concurrent memory access operations that the GPU is capable of. The GPU implementation outperforms the CPU implementation for voxels and the difference grows bigger as the number of voxels increases. For voxels the difference in performance is .

GPU performance

The average active threads per warp (with predication) is , and the average not predicated off threads per warp is . It increase slightly as the number of voxels in a grid increases. The active warps is not due to the if statement that verifies that a voxel of the measurement voxel grid is not unknown before merging it with the local voxel grid. The average warp cycles per executed instruction is .

V-D Comparison with mit-acl-mapping

The proposed method vastly outperforms mit-acl-mapping in terms of computation time (Fig. 7) while delivering similar results (Fig. 6). The reported computation times are for the setup explained at the beginning of Section V, while doing an exploration task using [tordesillas2020faster]. We show the distribution of the computation time (Fig. 7) as a boxplot. The CPU implementation of our method takes on average and is faster then mit-acl-mapping. The GPU implementation takes on average and is faster than the CPU implementation. Note that in addition to the kernels’ execution time (steps 2,3 and 4), the GPU time includes allocating (cudaMalloc) and freeing (cudaFree) memory on the device (GPU), setting this memory (cudaMemset) and transferring data from the host (CPU) to the device and from the device to the host (cudaMemcpy). On average, cudaMalloc takes , cudaMemcpy takes , cudaFree takes , cudaMemset takes and the kernels (launch + execution) take (Table II).

The results (Fig. 6) show that our method delivers close results to mit-acl-mapping but more conservative (less voxels are freed) which is expected since our ray is stopped when it hits an occupied voxel whereas mit-acl-mapping uses [bresenham1965algorithm] to trace every pixel of the image (Fig. 3(a)). The method used by mit-acl-mapping cannot be implemented on a GPU as is, and the ray-tracing method used [bresenham1965algorithm] can miss voxels that the ray passes through, unlike [amanatides1987fast] which we use.

Vi Conclusions and Future Works

We presented a novel method for the generation of voxel grids with occupied, free and unknown voxels. The method is efficient while sacrificing little accuracy. We compared our method to the state-of-the-art and implemented a GPU version of it which resulted in a considerable speed up. The CPU and GPU implementations outperform the state-of-the-art in computation time while delivering similar quality.

References