sgm
Semi-Global Matching on the GPU
view repo
Dense, robust and real-time computation of depth information from stereo-camera systems is a computationally demanding requirement for robotics, advanced driver assistance systems (ADAS) and autonomous vehicles. Semi-Global Matching (SGM) is a widely used algorithm that propagates consistency constraints along several paths across the image. This work presents a real-time system producing reliable disparity estimation results on the new embedded energy-efficient GPU devices. Our design runs on a Tegra X1 at 42 frames per second (fps) for an image size of 640x480, 128 disparity levels, and using 4 path directions for the SGM method.
READ FULL TEXT VIEW PDF
The Stixel World is a medium-level, compact representation of road scene...
read it
For intelligent vehicles, sensing the 3D environment is the first but cr...
read it
Fully parallel architecture at disparity-level for efficient semi-global...
read it
Cameras are the defacto sensor. The growing demand for real-time and
low...
read it
Efficient yet accurate extraction of depth from stereo image pairs is
re...
read it
Modern neural network-based algorithms are able to produce highly accura...
read it
This paper presents a genetic stereo matching algorithm with fuzzy evalu...
read it
Semi-Global Matching on the GPU
GPU-accelerated real-time stixel computation
GPU-accelerated stereo processing for Tegra X1 on ROS
Semi-Global Matching on the GPU
GPU-accelerated real-time stixel computation
Dense, robust and real-time computation of depth information from stereo-camera systems is a requirement in many industrial applications such as advanced driver assistance systems (ADAS), robotics navigation and autonomous vehicles. An efficient stereo algorithm has been a research topic for decades [1]. It has multiple applications, for example, [7] uses stereo information to filter candidate windows for pedestrian detection and provides better accuracy and performance.
Fig. 1 illustrates how to infer the depth of a given real-world point from its projection points on the left and right images. Assuming a simple translation between the cameras (otherwise, images must be rectified using multiple extrinsic and intrinsic camera parameters), the corresponding points must be in the same row of both images, along the epipolar lines. A similarity measure correlates matching pixels and the () is the similarity distance between both points.
Disparity estimation is a difficult task because of the high level of ambiguity that often appears in real situations. For those, a large variety of proposals have been extensively presented [13]. Most of the high-accuracy stereo vision pipelines [17] include the semi-global matching (SGM) consistency-constraining algorithm [9]. The combination of SGM with different kinds of local similarity metrics is insensitive to various types of noise and interferences (like lighting), efficiently deals with large untextured areas and is capable of retaining edges.
The high computational load and memory bandwidth requirements of SGM pose hard challenges for fast and low energy-consumption implementations. Dedicated hardware solutions (e.g. FPGA or ASIC) [3][11] achieve these goals, but they are very inflexible regarding changes in the algorithms. Implementations on desktop GPUs can assure real-time constraints [2], but their high power consumption and the need to attach a desktop computer makes them less suitable for embedded systems.
Recently, with the appearance of embedded GPU-accelerated systems like the NVIDIA Jetson TX1 and the DrivePX platforms (incorporating, respectively, one and two Tegra X1 ARM processors), low-cost and low-consumption real-time stereo computation is becoming attainable. The objective of this work is to implement and evaluate a complete disparity estimation pipeline on this embedded GPU-accelerated device.
We present simple, but well-designed, baseline massively parallel schemes and data layouts of each of the algorithms required for disparity estimation, and then optimize the baseline code with specific strategies, like vectorization or
-- conversion, to boost performance around 3 times. The optimized implementation runs on a single Tegra X1 at 42 frames per second (fps) for an image size of 640480 pixels, 128 disparity levels, and using 4 path directions for the SGM method, providing high-quality real-time operation. While a high-end desktop GPU improves around 10 times the performance of the embedded GPU, the performance per watt ratio is 2.2 times worse. The source code is available^{1}^{1}1https://github.com/dhernandez0/sgm.The rest of the paper is organized as follows. Section 2 presents the algorithms composing the disparity estimation pipeline, overviews the GPU architecture and programming model and discusses related work. In section 3 we describe each algorithm and then propose and discuss a parallel scheme and data layout. Finally, section 4 provides the obtained results and section 5 summarizes the work.
Fig. 2 shows the stages of the disparity computation pipeline: (1) the captured images are copied from the Host memory space to the GPU Device; (2) features are extracted from each image and used for similarity comparison to generate a local matching cost for each pixel and potential disparity; (3) a smoothing cost is aggregated to reduce errors (SGM); (4) disparity is computed and a 3
3 median filter is applied to remove outliers; and (5) the resulting disparity image is copied to the Host memory.
Different similarity metrics or cost functions have been proposed in the literature. The less computationally-demanding, and modest quality providers, are Sum of Absolute Differences, ZSAD and Rank Transform. According to [10], Hierarchical Mutual Information and the Census Transform (CT) features [16]
provide similar higher quality, being CT substantially less time-consuming. Recently, costs based on neural networks have outperformed CT
[17], but at the expense of a higher computational load.A CT feature encodes the comparisons between the values of the pixels in a window around a central pixel. After empirically evaluating different variants we selected a Center-Symmetric Census Transform (CSCT) configuration with a 97 window, which provides a more compact representation with similar accuracy [14]. The similarity of two pixels is defined as the Hamming distance of their CSCT bit-vector features. Two properties provide robustness for outdoor environments with uncontrolled lighting and in front of calibration errors: the invariance to local intensity changes (neighboring pixels are compared to each other) and the tolerance to outliers (an incorrect value modifies a single bit).
In order to deal with non-unique or wrong correspondences due to low texture and ambiguity, consistency constraints can be included in the form of a global two-dimensional energy minimization problem. Semi-global matching (SGM) approximates the global solution by solving a one-dimensional minimization problem along several (typically 4 or 8) independent paths across the image. For each path direction, image point and disparity, SGM aggregates a cost that considers the cost of neighboring points and disparities. The number of paths affects both the quality and the performance of the results.
GPUs are massively parallel devices containing tens of throughput-oriented processing units called streaming multiprocessors (SMs). Memory and compute operations are executed as vector instructions and are highly pipelined in order to save energy and transistor budged. SMs can execute several vector instructions per cycle, selected from multiple independent execution flows: the higher the available parallelism the better the pipeline utilization.
The CUDA programming model allows defining a massive number of potentially concurrent execution instances (called ) of the same program code. A unique two-level identifier <, > is used to specialize each thread for a particular data and/or function. A CTA ( ) comprises all the threads with the same , which run simultaneously and until completion in the same SM, and can share a fast but limited memory space. are groups of threads with consecutive s in the same CTA that are mapped by the compiler to vector instructions and, therefore, advance their execution in a lockstep synchronous way. The warps belonging to the same CTA can synchronize using a explicit barrier instruction. Each thread has its own private local memory space (commonly assigned to registers by the compiler), while a large space of global memory is public to all execution instances (mapped into a large-capacity but long-latency device memory, which is accelerated using a two-level hierarchy of cache memories).
The parallelization scheme of an algorithm and the data layout determine the available parallelism at the instruction and thread level (required for achieving full resource usage) and the memory access pattern. GPUs achieve efficient memory performance when the set of addresses generated by a warp refer to consecutive positions that can be coalesced into a single, wider memory transaction. Since the bandwidth of the device memory can be a performance bottleneck, an efficient CUDA code should promote data reuse on shared memory and registers.
A reference implementation of SGM on CPU [15] reached 5.43 frames per second (fps) with 640
480 image resolution and 128 disparity levels. They applied SGM with 8 path directions ad an additional left-right consistency check and sub-pixel interpolation. A modified version with reduced disparity computation (rSGM) was able to reach 12 fps.
Early GPU implementations [5] and [12] present OpenGL/Cg SGM implementations with very similar performance results peaking at 8 fps on 320240 resolution images.Versions designed for early CUDA systems and proposed specific modifications of the SGM algorithm. Haller and Nedevschi [8] modified the original cost aggregation formula removing the P1 penalty and using 4 path directions for cost aggregation. In this way, they reduced computation and memory usage, but also reduced accuracy. Their implementation reached 53 fps on a Nvidia GTX 280 with images of 512383.
The most recent implementation [2] stated very fast results: 27 fps on 1024768 images using a NVIDIA Tesla C2050, with 128 disparity levels. By using Rank Transform [16] as matching cost function, their proposal provides lower accuracy [10]. We will notice some differences in the parallel scheme on the following discussion.
As far as we know this is the first evaluation of disparity estimation in a Nvidia GPU-accelerated embedded system, as well as in the last Maxwell architecture. We propose better parallelization schemes to take advantage of the hardware features available in current systems.
This section describes the algorithms used for disparity computation and discusses the alternative parallelization schemes and data layouts. We present the baseline pseudocode for the proposed massively parallel algorithms and explain additional optimizations.
A 97-window, Center-Symmetric Census Transform (CSCT) concatenates the comparisons of 31 pairs of pixels into a bit-vector feature. Equation 1 defines the CSCT, where is bit-wise concatenation, is the value of pixel (,) in the input image, and (,) is if , or otherwise. The matching cost between a pixel () in the base image and each potentially corresponding pixel in the match image at disparity is defined by equation 2, where is bit-wise exclusive-or and counts the number of bits set to 1.
(1) |
(2) |
The data access patterns inherent in both equations exhibit different data reuse schemes, which prevent both algorithms to be fused. The 2D-tiled parallel scheme shown in Fig. 3 matches the 2D-stencil computation pattern of CSCT, and maximizes data reuse: the attached table shows how a tiled scheme using shared memory reduces the total global data accesses by times with respect to a straightforward, naïve, embarrassingly parallel design, where each thread reads its input values directly from global memory.
The 1D-tiled parallel scheme for computing matching cost (MC) exploits data reuse on the x-dimension (see Fig. 4). As proposed in [2], we can represent matching cost using a single byte without losing accuracy, which reduces 4 times the memory bandwidth requirements in comparison to using 32-bit integers. The attached table shows that the read-cooperative scheme, compared to the naïve design, sacrifices parallelism (divides the number of threads by , the maximum disparity considered) by higher data reuse (around 8 times less global memory accesses). The low arithmetic intensity of the algorithm (2 main compute operations every 9-Byte memory accesses) advises for this kind of optimization.
Algorithms 1 and 2 show the pseudocode of both parallel algorithms, not including special code for corner cases handling image and CTA boundaries. In both cases, threads in the same CTA cooperate to read an input data tile into shared memory, then synchronize, and finally perform the assigned task reading the input data from shared memory. The first algorithm assumes a CTA size of threads and the second algorithm a CTA of threads. They are both scalable designs that use a small constant amount of shared memory per thread (1.5 and 12 Bytes, respectively).
There are two memory-efficient layout alternatives for algorithm 2. Each CTA generates a slice in the y-plane of the MC matrix, and threads can generate together the cost for (1) all the disparity levels for the same pixel or (2) all the pixels in the block for the same disparity level. We chose the first option, and adapt the data layout so that the indexes of disparity levels vary faster on the MC cube and global write instructions are coalesced. The second solution, used in [2], provides similar performance on this algorithm but compromises the available parallelism and the performance of the following SGM algorithm.
The SGM method solves a one-dimensional minimization problem along different paths =() using the recurrence defined by equation 3 and a dynamic programming algorithmic pattern. Matrix L contains the smoothing aggregated costs for path . The first term of equation 3 is the original matching cost, and the second term adds the minimum cost of the disparities corresponding to the previous pixel (,), including penalties for small disparity changes () and for larger disparity discontinuities and (). is intended to detect slanted and curved surfaces, while smooths the results and makes abrupt changes difficult. The last term ensures that aggregated costs are bounded. For a detailed discussion refer to [9]. The different L matrices must be added together to generate a final cost and then select the disparity corresponding to the minimum (-- strategy), as shown by equation 4.
(3) |
(4) |
Equation 3 determines a recurrent dependence that prevents the parallel processing of pixels in the same path direction. Parallelism can be exploited, though, in the direction perpendicular to the path, in the disparity dimension, and for each of the computed path directions. Our proposal exploits all the available parallelism by creating a CTA for each slice in the aggregated cost matrix along each particular path direction.
Fig. 5 illustrates the case of the top-to-bottom path direction and algorithm 3 shows the pseudocode. Each of the slices is computed by a different CTA of threads, with each thread executing a recurrent loop (line 4) to generate cost values along the path. Computing the cost for the current pixel and disparity level requires the cost of the previous pixel on neighboring disparity levels: one value can be reused in a private thread register but the neighboring costs must be communicated among threads (lines 7,8 and 12). Finally, all threads in the CTA must collaborate to compute the minimum cost for all disparity levels (line 11).
The case for horizontal paths is very similar, with slices computed in parallel. Diagonal path directions are a little more complex: independent CTAs process the diagonal slices moving in a vertical direction (assuming ). When a CTA reaches a boundary, it continues on the other boundary. For example, a top-to-bottom and right-to-left diagonal slice starting at (x,y) = (100,0) will successively process pixels (99,1), (98,2) … (0, 100), and then will reset the costs corresponding to the previous pixel and continue with pixels (-1,101), (-2,102) …
The cost aggregation and disparity computation defined by equation 4 have been fused in Algorithm 4 in order to reduce the amount of memory accesses (avoids writing and then reading the final cost matrix). A CTA-based parallel scheme is proposed so that each CTA produces the disparity of a single pixel (line 7): first, each CTA thread adds the costs corresponding to a given disparity level for all path directions (line 4), and then CTA threads cooperate to find the disparity level with minimum cost (line 5).
We have applied three types of optimizations to the baseline algorithms that provided a combined performance improvement of almost 3. We have vectorized the inner loop of algorithm 3 (lines 4-12) to process a vector of 4 cost values (4 bytes) per instruction (requiring a special byte-wise SIMD instructions for computing the minimum operation). We have also modified the parallel scheme so that a single warp performs the task previously assigned to a CTA, which we call -- conversion. It (1) avoids expensive synchronization operations, (2) allows using fast register-to-register communication (using special shuffle instructions) instead of shared-memory communications, and (3) reduces instruction count and increases instruction-level parallelism. A drawback of both strategies is a reduction of thread-level parallelism, as shown in [4]. This is not a severe problem in the embedded Tegra X1 device, with a maximum occupancy of 4 thousand threads.
Finally, to reduce the amount of data accessed from memory, the computation of the aggregated cost for the last path direction (Alg. 3 Bottom-to-Top) is fused with the final cost summation and disparity computation (Alg. 4), providing a 1.35x performance speedup on the Tegra X1. Also, fusing the computation of the initial matching cost (Alg. 2) with the aggregate cost computation for the horizontal path directions (Alg. 3) improves performance by 1.13x.
We have measured execution time and disparity estimation accuracy for multiple images, 128 disparity levels, and 2, 4 and 8 path directions. Apart from executing on a NVIDIA Tegra X1, which integrates 8 ARM cores and 2 Maxwell SMs with a TDP of 10W, and for comparison purposes, we have also executed on a high-end NVIDIA Titan X, with 24 Maxwell SMs and a TDP of 250W. We ignore the time for CPU-GPU data transfers (less than 0.5% of the total elapsed time) since it can be overlapped with computation. Since performance scales proportional to the number of image pixels, we will restrict our explanation to images.
The legend in Fig. 6 indicates the disparity estimation accuracy, measured using the KITTI benchmark-suite [6], when using different SGM configurations, and not considering occluded pixels and treating more than 3 pixel differences as errors. Using 4 path directions (excluding diagonals) reduces accuracy very slightly, while using only the left-to-right and top-to-bottom directions reduces accuracy more noticeably.
The left and right charts in Fig. 6 show, respectively, the performance throughput (frames per second, or fps) and the performance per watt (fps/W) on both GPU systems and also for different SGM configurations. The high-end GPU always provides more than 10 times the performance of the embedded GPU (as expected by the difference in number of SMs), but the latter offers around 2 times more performance per Watt. It is remarkable that real-time rates (42 fps) with high accuracy are achieved by the Tegra X1 when using 4 path directions.
Finally, an example of the disparity computed by our proposed algorithm can be seen in Fig. 6(b).
The results obtained show that our implementation of depth computation for stereo-camera systems is able to reach real-time performance on a Tegra X1. This fact indicates that low-consumption embedded GPU systems, like the Tegra X1, are well capable of attaining real-time processing demands. Hence, their low-power envelope and remarkable performance make them good target platforms for real-time video processing, paving the way for more complex algorithms and applications.
We have proposed baseline parallel schemes and data layouts for the disparity estimation algorithms that follow general optimization rules based on a simple GPU performance model. They are designed to gracefully scale on the forthcoming GPU architectures, like NVIDIA Pascal. Then, we have optimized the baseline code and improved performance around 3 times with different specific strategies, like vectorization or -- conversion, that are also expected to be valid for forthcoming architectures.
We plan to prove the higher performance potential of the new embedded NVIDIA Pascal GPUs to enable real-time implementations with larger images and a higher number of disparity levels, and more complex algorithms that provide better estimation results. In this sense, we are going to include post-filtering steps such as Left-Right Consistency Check, subpixel calculation, and adaptive P2, which are well-known methods of increasing accuracy.
This research has been supported by the MICINN under contract number TIN2014-53234-C2-1-R. By the MEC under contract number TRA2014-57088-C2-1-R, the spanish DGT project SPIP2014-01352, and the Generalitat de Catalunya projects 2014-SGR-1506 and 2014-SGR–1562. We thank Nvidia for the donation of the systems used in this work.
Conference on Computer Vision and Pattern Recognition
, 2012.
Comments
There are no comments yet.