MTStereo 2.0: improved accuracy of stereo depth estimation withMax-trees

06/27/2020 ∙ by Rafael Brandt, et al. ∙ University of Groningen 6

Efficient yet accurate extraction of depth from stereo image pairs is required by systems with low power resources, such as robotics and embedded systems. State-of-the-art stereo matching methods based on convolutional neural networks require intensive computations on GPUs and are difficult to deploy on embedded systems. In this paper, we propose a stereo matching method, called MTStereo 2.0, for limited-resource systems that require efficient and accurate depth estimation. It is based on a Max-tree hierarchical representation of image pairs, which we use to identify matching regions along image scan-lines. The method includes a cost function that considers similarity of region contextual information based on the Max-trees and a disparity border preserving cost aggregation approach. MTStereo 2.0 improves on its predecessor MTStereo 1.0 as it a) deploys a more robust cost function, b) performs more thorough detection of incorrect matches, c) computes disparity maps with pixel-level rather than node-level precision. MTStereo provides accurate sparse and semi-dense depth estimation and does not require intensive GPU computations like methods based on CNNs. Thus it can run on embedded and robotics devices with low-power requirements. We tested the proposed approach on several benchmark data sets, namely KITTI 2015, Driving, FlyingThings3D, Middlebury 2014, Monkaa and the TrimBot2020 garden data sets, and achieved competitive accuracy and efficiency. The code is available at https://github.com/rbrandt1/MaxTreeS.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Estimation of scene depth from stereo image pairs is deployed as a building block in high level computer vision applications, such as autonomous car driving 

[10, 27], obstacle avoidance by robots [24, 31], simultaneous localization and mapping [9], among others. In two-view stereo matching, the three-dimensional structure of a scene is recovered by finding correspondent pixels in image pairs. Two pixels in the left and right image, respectively and , match when they capture the same scene point. The horizontal offset (i.e. disparity) of the pixels is used to compute the distance of the captured scene point as , where is the camera baseline and the camera focal length.

The similarity of two pixels is quantitatively computed by a matching cost function, e.g. absolute image gradient or gray-level difference [30]. Substantial matching ambiguity is however caused by repetitive patterns and uniformly colored regions. Hence, costs are aggregated over neighbor pixels to strengthen the robustness of the matching evaluation. For instance, color similarity and proximity were used for disparity estimation in [41] and [44]. A scheme that takes into account the strength of image boundaries between pixels was proposed in [3]. Early methods performed exhaustive matching search [1], which requires many computations. Later approaches reduced the disparity search-range by computing a coarse disparity map first and then refining it in an iterative approach [11]. Image pyramids were also used to reduce disparity search range in [34, 19]. A coarse disparity map is estimated considering the full disparity range. Then, increasingly higher-resolution disparity maps are constructed whereby the disparity search range is dictated by the previously computed (coarser) disparity map. To increase efficiency and reduce matching ambiguity, matching (hierarchically structured) image regions instead of individual pixels was proposed [6, 22, 35]. Such methods may require computationally expensive segmentation steps.

Recent approaches use convolutional neural networks (CNNs) to compute aggregated matching costs. One of the first CNN-based stereo matching methods deployed a siamese network architecture [42], which works with small patch inputs. Approaches have been suggested to increase the receptive field while maintaining details in estimated disparity maps. In [4], pairs of siamese networks, each receiving as input a pair of patches at different scales were used and the matching cost was computed as the inner product between the responses of the siamese networks. Various approaches were further developed to improve the quality and accuracy of depth maps estimated by CNNs, namely recurrent architecture blocks [5], stacked modules with short-cut connections and separable convolutions [7], group-wise convolutions [12], 3D convolutional layers as in GC-Net [17], and layers focused on local- and whole-image features to compute cost dependencies [43].

Although CNN-based methods compute highly accurate disparity maps, they need power-consuming dedicated hardware or GPUs to compute the many convolutions. This limits their usability on embedded or power-constrained systems. This applies, for instance, to battery-powered robots or drones, for which depth estimation has to trade-off between accuracy and computational efficiency. Furthermore, for robot navigation and obstacle avoidance, the very high accuracy and density of estimation achieved by CNNs are not strictly necessary.

In this paper, we present a stereo matching method, named MTStereo 2.0, that balances efficiency with effectiveness, making it appropriate for devices where limited computational and energy resources are available. The MTStereo 1.0 algorithm, that we proposed in [2], exploits contrast information of objects in a hierarchical fashion to efficiently and effectively perform stereo matching. It constructs a hierarchical representation of image scan-lines using Max-Trees [28], and performs disparity estimation via tree matching cost computation that takes into account contextual image structural information. The MTStereo 2.0 algorithm that we propose improves on the 1.0 version as it a) deploys a more robust cost function, b) performs more thorough incorrect match detection, c) computes disparity maps with pixel-level rather than node-level precision.

We carried out an extensive experimental benchmark on several data sets, namely KITTI 2015, Driving, FlyingThings3D, Middlebury 2014, Monkaa and the TrimBot2020 garden data sets.

Ii MTStereo 2.0

We present MTStereo 2.0, which exploits contrast information of objects in a hierarchical fashion to efficiently and effectively perform stereo matching. In the following, we outline the elements of MTStereo 1.0 that are also present in MTStereo 2.0, and detail the novel elements that we introduced in the new version of the algorithm. We provide an implementation of MTStereo 2.0 available at https://github.com/rbrandt1/MaxTreeS.

Ii-a The method

A pair of stereo images depicting a cube is shown in Fig. 1 (top row) together with their processed version by using an edge-detector filter (middle row). Darker regions in the processed images (middle row) correspond to regions with higher contrast, while lighter regions correspond to areas with less contrast. We use the contrast information to construct Max-Trees of the image scan-lines. In Fig. 1, examples of Max-Trees constructed for the highlighted scan-lines in the middle row are depicted in the third row. Matching the finest structures directly results in non-efficient yet precise stereo matching, while matching coarse structures results in disparity maps that lack precision but can be efficiently obtained. To tackle this trade-off and efficiently obtain precise disparity maps, our method matches increasingly finer regions in an iterative manner, and only compares regions contained in earlier matched coarser ones. MTStereo 2.0 is composed of the following steps: pre-processing, Max-Tree construction, cost volume computation, cost aggregation, consistency check, disparity map computation, confidence check and map refinement.

Fig. 1: A (top row) pair of stereo images and their (middle row) version resulted by edge-detection (darker regions contain more contrast). The Max-Tree representations of the horizontal scan-lines highlighted in the middle row are illustrated in the below row: MTstereo 2.0 computes the disparity for coarse structure first (nodes represented by darker bars) and increasingly matches only finer structures (with lighter color) contained in earlier matched coarser ones.

Pre-processing

We process the input rectified image pair images with a median filter for noise removal. Subsequently, we detect edges in the images by a horizontal and a vertical Sobel operator, and average their absolute response images pixel-wise. We invert the result image, perform contrast-stretch and color quantization. We show an example of processed image in Fig. 2a. We compute a 1D Max-Tree for each scan-line of the pre-processed images. A parameter controls the number of colors for color quantization and influences the size of the constructed trees. Shallower trees are less expensive to match, but represent less precisely the image structures.

(a)
(b)
(c)
(d)
(e)
(f)
Fig. 2:

Outputs at intermedate algorithm stages: (a) pre-processed image of Middlebury training data set, (b) output of coarse-to-fine matching (c) maps after outlier removal, (d) reliable node extrapolation, (e) output of guided pixel matching, (f) output map after outlier removal.

Max-Tree construction

The coarse-to-fine matching is facilitated by a hierarchical representation of both images in a rectified stereo pair: we compute 1D Max-Trees on the scan-lines the images. We store regions with less contrast being contained in regions with more contrast [35]. The Max-Tree, proposed by [28], allows storing the hierarchy of connected components resulting from different thresholds and is efficiently constructed. A 1D connected component set contains the 1-valued pixels for which no 0-valued pixel exists in between any of the pixels in the binary image resulting from applying a threshold to a 1D gray-scale image (i.e. a scan-line). We construct 1D Max-Trees using the algorithm in [39].

Cost volume

We construct a cost volume by computing the matching cost of each pixel in the left image with those in the right image at all possible disparity levels. The matching cost of two pixels is the weighted-average of the absolute difference between their gray-level, horizontal Sobel, and vertical Sobel values. We process each slide of the cost volume with a Gaussian blur operator, which smooths the estimation in textured areas and facilitates the subsequent steps of the algorithm. The parameter controls the size of the Gaussian blur kernel (). MTStereo 1.0 does not make use of a smoothed cost volume and does not consider gray-level difference in the cost computation.

Matching cost aggregation

The constructed cost volume and tree structures are then used for matching of regions in image pairs. The nodes corresponding to coarse image structures are initially matched. Thereafter, increasingly finer structures are matched. Only nodes with a coarseness in set are matched (see [2] for details). Only the finest nodes and their descendants are matched, of which the width is greater than and less than .

The matching cost of node pairs consists of a context cost and an intensity cost. The context cost is the average relative difference between the area of corresponding ancestors of a node pair. Given a node pair (,), we define that its ancestors (,), where is an ancestor of and is an ancestor of , can be matched if in the Max-Trees the number of nodes between and , and and

is equal. We define the intensity cost as the average of the cost volume matching costs at aligned pixels part of the two nodes. Given a node pair, we compute the matching cost at the location of the left node’s left (right) endpoint and disparity between the left (right) endpoints of both nodes, as well as at linearly interpolated disparity and pixel values in between the two endpoints. This definition of intensity cost is more robust than that of MTStereo 1.0. Parameter

controls the relative weight of the intensity cost () and context cost ().

Let denote a pair of nodes, respectively in the left and right image, both at row . Matching cost is aggregated over their node neighborhood. We define the neighborhood of as the nodes with the same coarseness level as that have similar x-coordinates and are in scan-lines next to each other in the original image. Two nodes have the same coarseness level when the distance (i.e. the difference in tree level) between the leaf nodes with the greatest level out of the leaf nodes which are descendants of the nodes, and the nodes themselves are equal. Recursively, node pair is part of the neighborhood of , if both crosses the x-coordinate of the center of node , crosses the x-coordinate of the center of node , and both and have a y-coordinate which is one lower or higher than that of and . At most, the nodes above and below a node pair are included in the neighborhood of . In the coarse-to-fine matching procedure, we compute a disparity search range for each node in each iteration. Given a node pair that has likely been correctly matched in a previous iteration, only descendants of this node pair are matched in subsequent iterations. Nodes are considered likely correctly matched when they pass a left-right consistency check [38] and a confidence check (peak-ratio used in [40]). The confidence check, not used in MTStereo 1.0, provides more accurate disparity maps as it filters out ambiguous (and thus likely incorrect) matches.

Dataset Description Synthetic Size in pixels Subset Dense GT
Middlebury 2014 [13] Indoor scenes. While average accuracy results were weighted by the official weights, average density results were not weighted. The Full-size version was used in our experiments. Some of the results which were taken from the benchmark website were generated using a down-scaled version of the data set.
Kitti 2015 [23] Street scenes captured from the viewpoint of a driving car.
Trimbot2020 Synthetic Garden [36] Outdoor garden scenes rendered from 3D synthetic models of gardens in the context of the TrimBot2020 project [33]. Test set used by [26].
Trimbot2020 Real Garden [29] Outdoor garden scenes recorded in the test garden of the said TrimBot2020 project. Test set used by [26].
Driving [20] Realistic street scene images captured from the viewpoint of a driving car. Cleanpass, fast, 35mm_focallength
Monkaa [20] Un-naturalistic images of furry objects in outdoor scenes. Cleanpass
Flying Things 3D [20] Textured objects moving in random 3D paths. Cleanpass, test set. Stereo pairs excluded by [21] were excluded in our experiments as well.
TABLE I: Details of the data sets used for the experiments.

Disparity map

The coarse to fine matching procedure results in a list of matched node pairs. For each of the finest nodes, the disparity at the left and right endpoints of matched nodes is computed and linearly interpolated. A resulting disparity map is illustrated in Fig. 2b. We process the disparity map to remove outliers: a disparity value is removed when in the local neighborhood the number of pixels with a disparity difference that exceeds their -offset surpasses the number of pixels with a disparity difference less than or equal to their -offset. A disparity map after noise removal is illustrated in Fig. 2c.

We improve the density and accuracy of a disparity map by computing median disparity values of nodes which are neighbors across scan-lines. Hence, nodes previously without disparity assignment obtain a disparity value and nodes with an outlier disparity assignment are corrected. The disparity values at the endpoints of the finest nodes are stored in the Max-Trees. Thereafter, for each of the finest nodes, the median of the left and right side disparity among the neighbor nodes above and below the node which have a disparity assigned to their endpoints are stored in the node. When a semi-dense disparity map is generated, the disparity at the left and right endpoints of matched nodes is linearly interpolated, while the disparity at the left and right endpoints of matched nodes is only assigned to the endpoints of the node pair if a sparse disparity map is being generated. A disparity map after reliable node extrapolation is illustrated in Fig. 2d. Differently from MTStereo 1.0, we perform pixel matching to recover surface shape. Disparity search range for pixels is set such that only disparities which are percent more or less than the previously computed disparity value are considered. The matching cost of pixel pairs is derived from the constructed cost volume.

Confidence check and map refinement

We introduce a step of confidence check on the estimated pixel disparities. If a match does not pass the confidence check, it is not taken into account for the disparity map. A node pair passes the confidence check when the relative difference between the matching cost of the best match and that of the second-best match is more than a percentage controlled by parameter . Areas which contain more texture are more likely to pass the confidence check. A disparity map after guided pixel matching is illustrated in Fig. 2e. We obtain the final disparity map by removing the outliers. A disparity map after noise removal is illustrated in Fig. 2f.

Iii Experiments

We carried out an extensive evaluation of the MTStereo 2.0 performance on several benchmark data sets, of which we report details in Table I, and compared its results with those of other methods. We ran our algorithm on an Intel®  Core™  i7-2600K CPU @3.40GHz.

Middlebury Kitti2015 Real Garden Synth Garden Driving Monkaa Flying Things
Train Test Train Test
MTS2.0 (sparse) 2.35(3%) 9.36(3%) 1.35(4%) 7.83(4%) 3.08(5%) 5.22(6%) 5.36(3%) 3.83(4%) 1.7(6%)
MTS2.0 (semi-dense) 6.51(24%) (25%) 3.72(20%) 7.04(19%) 7.38(15%) 6.54(25%) 2.61(30%)
MTS1.0 (sparse) 5.47(2%) 15.5(2%) 1.58(2%) 8.92(3%) 2.48(2%) 7.32(2%) 8.8(1%) 7.22(3%) 3.03(4%)
MTS1.0 (semi-dense) 17.47(57%) 4.47(44%) 3.78(18%) 12.8(14%) 16.2(38%) 15.41(40%) 6.93(58%)
SGBM1 [14] 7.83(67%) 16.3(63%) 1.45(84%) 2.61(89%) 4.77(92%) 16.06(70%) 11.37(81%) 5.14(86%)
SGBM2 [14] 8.92(83%) 15.9(77%) 1.27(82%) 5.86(90%) 2.16(90%) 4.67(90%) 15.72(64%) 10.5(78%) 4.58(85%)
ELAS_ROB [11] 10.5 13.4 1.49(99%) 9.67 2.06 7.02 11.71 17.75 7.46
FPGA Stereo [15] 2.94[26] 11.41[26]
DispNet [21] 0.68[21] 4.34 1.35[26] 6.28[26] 15.62[21] 5.78[21] 1.68[21]
EdgeStereo [32] 2.00 3.72 2.07(?%)[32] 2.08 0.74(?%)[32]
iResNet [18] 1.21(?%)[32] 2.44(?%)[32] 0.95(?%)[32]
TABLE II: Results expressed in avgerr, except Kitty2015 Test which are expressed in D-all-est. Average prediction density is reported in brackets. Entries without density specification are 100% dense. Entries with ?% density specification have unknown density. We mark with the results that were (possibly) computed on a different set of images than we used for evaluation. We mark with that result was taken from respective benchmark website. We mark with results that were (possibly) computed using a different metric computation approach.
MotionStereo [37] Sparse SNCC [8] LS-ELAS [16] ELAS [11] SED [25] Semi-Dense SGBM1 [14] SGBM2 [14]
1.72(46%) 2.35(3%) 3.25(62%) 4.35(61%) 4.94(73%) 5.38(1%) 6.51(24%) 7.83(67%) 8.92(83%)
TABLE III: Results on the Middlebury training benchmark expressed in avgerr. Results of MTStereo 2.0 are bold. All results taken from respective benchmark website, except that of Semi-Dense. We mark with the results that were computed on a different set of images than we used for evaluation.
MotionStereo [37] SNCC [8] LS-ELAS [16] Sparse ELAS [11] SED [25] SGBM1 [14] SGBM2 [14]
3.30(38%) 3.96(55%) 9.10(49%) 9.36(3%) 10.6(66%) 12.3(2%) 16.3(63%) 15.9(77%)
TABLE IV: Results on the Middlebury test benchmark expressed in avgerr. Results of MTStereo 2.0 are bold. All results taken from respective benchmark website. We mark with the results that were computed on a different set of images than we used for evaluation.

Iii-a Evaluation

We computed standard metrics for the concerned benchmarks, allowing direct comparison with existing methods:

  • avgerr: the average absolute disparity error (in pixels) among all pixels of which both a disparity value was in the ground truth and estimated disparity map.

  • D-all-est: the percentage of stereo disparity outliers (the disparity error is 3px and 5% of the true disparity) among all pixels of which both a disparity value was in the ground truth and estimated disparity map.

  • Density: the percentage of pixels with a disparity prediction with respect to the total number of pixels in the reference image. Values were rounded to the nearest decimal.

  • time/MP: the execution time in seconds normalized by the number of megapixels in the reference image.

We defined a single set of parameters that contributes to achieve robust performance across the different benchmark data sets. The number of color quantization levels was set to 16 (8) when sparse (semi-dense) disparity maps were generated. The weight of the context cost relative to that of the gradient cost was set to 0.8. The minimum (or maximum) width of nodes to be matched (or ) was set to 0 (or of the input image width). Matched node levels was set to . The maximum neighbourhood size was set to 10. The size of the Gaussian kernel used to aggregate the cost volume was . The minimum confidence percentage parameter was set to during coarse-to-fine matching. In guided pixel refinement, was set to () when sparse (semi-dense) disparity maps were generated. was set to .

Iii-B Results and comparison

In Table II, we report the results achieved by our method on the considered data sets as well as those of other existing methods that do not formulate the stereo matching as a learning problem (upper part) and those that deploy neural networks to tackle the disparity estimation problem (lower part). The results of MTStereo 1.0 were generated using the parameters used in [2]. We set the parameter in all experiments except for those involving the Trimbot datasets, wherein we set . In Table III and Table IV, we report the error achieved by our method on the Middlebury training and test data, respectively, as well as those of other existing methods that have low average time/MP and do not run on a GPU. The lower the values, the better the performance. Both the sparse and semi-dense versions of our method achieved better or comparable accuracy results than those of many existing methods. The accuracy results that we achieved, especially in the case of the sparse estimation, are comparable to those achieved by approaches that deploy architectures based on convolutional networks and are trained with large amount of labeled data.

The results of MTStereo 2.0 sparse and semi-dense are generally more accurate than those of their MTStereo 1.0 counterpart method. Note that the density of the disparity maps produced by the methods are influenced by the parameters, which can be used to trade-off between accuracy and density. The cases where MTStereo 2.0 produces more accurate and dense disparity maps than MTStereo 1.0 (e.g. for the sparse variants when evaluated on the Middlebury, Kitti2015, Synth garden, and Driving data sets) suggest that MTStereo 2.0 is a more robust method than MTStereo 1.0.

The average disparity error (avgerr) of disparity maps produced by MTStereo 2.0 sparse is often lower than that of disparity maps produced by the other listed methods. The results on the Middlebury data sets show, for example, the avgerr in sparse disparity maps produced by MTStereo 2.0 is lower than that of all other listed methods except MotionStereo. Also, the avgerr on the Kitti training data set of MTStereo 2.0 sparse is lower than that of all other methods except DispNet, iResNet, and SGBM2. Furthermore, the avgerr of disparity maps produced by MTStereo 2.0 sparse is lower than all other non-learning based methods when evaluated on Driving, Monkaa, and Flying Things 3D. MTStereo 2.0 sparse produced more accurate disparity maps than some of the listed CNN based methods when evaluated on the Kitti2015 training, Synthetic garden, Driving, and Monkaa data sets. Note that the performance of CNN based methods depend on the used training set, and the densities of the disparity maps produced by said CNN based methods are higher. The avgerr of disparity maps produced by MTStereo 2.0 semi-dense is often lower than that of disparity maps produced by other listed methods. The averr achieved by MTStereo 2.0 is frequently close to that of MTStereo 1.0 sparse, while it produces more dense disparity maps.

The application of stereo matching in robotics introduces challenges: robots frequently have a tight power budget, real-time performance is required, and stereo cameras can be of low quality. The Real Garden and Synthetic Garden data sets contain stereo pairs that are observed by the TrimBot2020 gardening robot [33]. The performance of our method on the Real Garden and Synthetic Garden data sets, which contains low resolution stereo pairs, is comparable to that of the existing non-learning based methods. Together with the implementation of MTStereo 2.0, we released a ROS-node version optimized for scenes containing plants, and examples of point clouds it produces, as the algorithm was deployed on the TrimBot2020 robot111https://www.youtube.com/watch?v=ZX3rPrq6x5g. The accuracy of disparity maps produced by learning based methods is affected by differences between the data set used for training and that used for evaluation. The results of DispNet on Kitti2015 were obtained with a model trained on the FlyingThings3D data set, and subsequently fine-tuned on the Kitti2015 data set. When DispNet is not fine-tuned on Kitti2015, its error (avgerr) increases to 1.59 [21]. The different characteristics of the data sets seem to limit performance.

Example image Ground truth Semi-Dense Sparse
Middlebury
Synth Garden
Monkaa
Flying things

Fig. 3: Example images from the Middleburry, Trimbot2020 Synthetic Garden, Monkaa, and Flying Things 3D data sets, with corresponding ground truth depth images, our semi-dense and sparse reconstruction. Morphological dilatation was applied to sparse outputs for visualization purposes.

Example image Error MTStereo v1 Error MTStereo v2
Semi-Dense
Sparse

Fig. 4: Difference between the accuracy of MTStereo 1.0 and 2.0. Example image taken from kitti2015 training data set. Morphological dilatation was applied to all outputs for visualization purposes.

In Fig. 3, we show example images from the Middlebury, Trimbot2020 Synthetic Garden, Monkaa, and Flying Things 3D data sets, with corresponding ground truth depth images, our semi-dense and sparse depth reconstruction. One can notice that our method is able to robustly extract depth information also in image regions with little texture. For example, the depth of the gray cubes in the Flying Things example image is reasonably well estimated although the cube surface are rendered with plain colors. Regions with very little or no texture have usually an inherent disparity ambiguity, and our method allows for controlling such cases. When the parameter is set to a low value, an assumption is made that regions with little texture are flat. The dis-ambiguous disparity values at the left and right side edges of regions with little or no texture are then linearly interpolated within the region. When the assumption is not correct, such as between the tree branches in the MTStereo semi-dense disparity map of the Synth Garden example, the parameter can be increased such that no disparity values are assigned in the ambiguous region. Edge blurring artifacts are largely not present, as it can be seen for the edges of the pipes in the disparity maps of the Middlebury example (Fig. 3, first row) or those of the tree in the estimated maps of the Synth Garden esample (Fig. 3, second row). The disparity maps produced by our method, although sparse, are accurate and contain (very) little outliers. However, the estimated maps are dense enough to support applications such as robot navigation or visual servoing.

In Fig. 4, we show the difference between the accuracy of MTStereo 1.0 and 2.0 (lighter pixels have greater error). MTStereo 2.0 is able to more robustly extract depth information than MTStereo 1.0. The disparity maps produced by MTStereo 2.0 generally contain less estimations with great error than those produced by MTStereo 1.0.

The time in seconds needed to process image pairs normalized by their size in megapixels (time/MP) was for sparse (semi-dense) estimation by MTStereo v1 on Middlebury training 0.54 (0.52), Kitti2015 training 0.39 (0.36), Real Garden 0.37 (0.32), Synth Garden 0.25 (0.22), Driving 0.32 (0.29), Monkaa 0.28 (0.24), and Flying Things 0.34 (0.30). The time in seconds needed to process image pairs normalized by their size in megapixels (time/MP) was for sparse (semi-dense) estimation by MTStereo v2 on Middlebury training 7.43 (7.23), Kitti2015 training 4.45 (4.05), Real Garden 1.76 (1.93), Synth Garden 1.72 (1.88), Driving 4.02 (3.21), Monkaa 2.67 (3.21), and Flying Things 3.88 (3.95).

The processing times of methods included in this paper are listed on the Middlebury and Kitti benchmark websites, and are generally lower than those needed by MTStereo 2.0. The efficiency of our code can be increased through code-level optimization, e.g. of our cost volume implementation. The average relative time needed to process the different steps of MTStereo 2.0 on Kitti2015 training was for sparse estimation 24% computing the cost volume, 20% coarse node determination, 45% coarse to fine matching (of which 22% cost volume lookup), 7% noise filtering, and the remaining processing time was divided over the other steps. For semi-dense estimation 27% computing the cost volume, 12% coarse node determination, 40% coarse to fine matching (of which 31% cost volume lookup), 17% noise filtering, and the remaining processing time was divided over the other steps. A significant portion of the processing time is hence consumed by cost volume related tasks.

Iv Conclusions

We proposed a stereo matching method, called MTStereo 2.0, for systems with limited computational resources that require efficient and accurate depth estimation. It improves on its predecessor MTStereo 1.0 as it: a) deploys a more robust cost function, b) performs more thorough detection of incorrect matches, c) computes disparity maps with pixel-level rather than node-level precision.

MTStereo 2.0 produces disparity maps which are generally more accurate than those produced by MTStereo 1.0, and does not require intensive GPU computations like methods based on CNNs. It can thus run on embedded and robotics devices with low-power requirements. The higher accuracy achieved by MTStereo 2.0 w.r.t. its predecessor is attributable to the thorough re-design that we made.

Our method achieves competitive results on several benchmark data sets: Middlebury 2014, KITTI 2015, Driving, FlyingThings3D, Monkaa and the TrimBot2020 garden. The density of the MTStereo 2.0 disparity maps is enough for many applications of robot navigation or visual servoing. This is demonstrated in the implementation that we release, together with its ROS version, at https://github.com/rbrandt1/MaxTreeS.

Acknowledgments

This work received support from the EU Horizon 2020 program, under the project TrimBot2020 (grant No. 688007)

References

  • [1] R. D. Arnold (1983) Automated stereo perception.. Technical report STANFORD UNIV CA DEPT OF COMPUTER SCIENCE. Cited by: §I.
  • [2] R. Brandt, N. Strisciuglio, N. Petkov, and M. H.F. Wilkinson (2020) Efficient binocular stereo correspondence matching with 1-d max-trees. Pattern Recognition Letters. External Links: ISSN 0167-8655, Document, Link Cited by: §I, §II-A, §III-B.
  • [3] D. Chen, M. Ardabilian, X. Wang, and L. Chen (2013) An improved non-local cost aggregation method for stereo matching based on color and boundary cue. In IEEE ICME, pp. 1–6. Cited by: §I.
  • [4] Z. Chen, X. Sun, L. Wang, Y. Yu, and C. Huang (2015) A deep visual correspondence embedding model for stereo matching costs. In IEEE ICCV, pp. 972–980. Cited by: §I.
  • [5] X. Cheng, P. Wang, and R. Yang (2018) Learning depth with convolutional spatial propagation network. arXiv preprint arXiv:1810.02695. Cited by: §I.
  • [6] L. Cohen, L. Vinet, P. T. Sander, and A. Gagalowicz (1989) Hierarchical region based stereo matching. In IEEE CVPR, pp. 416–421. Cited by: §I.
  • [7] X. Du, M. El-Khamy, and J. Lee (2019) AMNet: deep atrous multiscale stereo disparity estimation networks. External Links: arXiv:1904.09099 Cited by: §I.
  • [8] N. Einecke and J. Eggert (2010) A two-stage correlation method for stereoscopic depth estimation. In DICTA, pp. 227–234. Cited by: TABLE III, TABLE IV.
  • [9] J. Engel, J. Stückler, and D. Cremers (2015) Large-scale direct slam with stereo cameras. In IEEE/RSJ IROS, pp. 1935–1942. Cited by: §I.
  • [10] A. Geiger, M. Lauer, C. Wojek, C. Stiller, and R. Urtasun (2013)

    3d traffic scene understanding from movable platforms

    .
    IEEE Trans. Pattern Anal. Mach. Intell 36 (5), pp. 1012–1025. Cited by: §I.
  • [11] A. Geiger, M. Roser, and R. Urtasun (2010) Efficient large-scale stereo matching. In Asian conference on computer vision, pp. 25–38. Cited by: §I, TABLE II, TABLE III, TABLE IV.
  • [12] X. Guo, K. Yang, W. Yang, X. Wang, and H. Li (2019) Group-wise correlation stereo network. In CVPR, Cited by: §I.
  • [13] H. Hirschmuller and D. Scharstein (2007) Evaluation of cost functions for stereo matching. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: TABLE I.
  • [14] H. Hirschmuller (2008-02) Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30 (2), pp. 328–341. External Links: Document Cited by: TABLE II, TABLE III, TABLE IV.
  • [15] D. Honegger, T. Sattler, and M. Pollefeys (2017) Embedded real-time multi-baseline stereo. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 5245–5250. Cited by: TABLE II.
  • [16] R. A. Jellal, M. Lange, B. Wassermann, A. Schilling, and A. Zell (2017) LS-elas: line segment based efficient large scale stereo matching. In IEEE ICRA, pp. 146–152. Cited by: TABLE III, TABLE IV.
  • [17] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry (2017) End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 66–75. Cited by: §I.
  • [18] Z. Liang, Y. Feng, Y. Guo, H. Liu, L. Qiao, W. Chen, L. Zhou, and J. Zhang (2017) Learning deep correspondence through prior and posterior feature constancy. arXiv preprint arXiv:1712.01039. Cited by: TABLE II.
  • [19] X. Luo, X. Bai, S. Li, H. Lu, and S. Kamata (2015) Fast non-local stereo matching based on hierarchical disparity prediction. arXiv preprint arXiv:1509.08197. Cited by: §I.
  • [20] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Note: arXiv:1512.02134 External Links: Link Cited by: TABLE I.
  • [21] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048. Cited by: TABLE I, §III-B, TABLE II.
  • [22] G. Medioni and R. Nevatia (1985) Segment-based stereo matching. Computer Vision, Graphics, and Image Processing 31 (1), pp. 2–18. Cited by: §I.
  • [23] M. Menze, C. Heipke, and A. Geiger (2015) Joint 3d estimation of vehicles and scene flow. In ISPRS Workshop on Image Sequence Analysis (ISA), Cited by: TABLE I.
  • [24] H. Oleynikova, D. Honegger, and M. Pollefeys (2015) Reactive avoidance using embedded stereo vision for mav flight. In IEEE ICRA, pp. 50–56. Cited by: §I.
  • [25] D. Peña and A. Sutherland (2017) Disparity estimation by simultaneous edge drawing. In ACCV 2016 Workshops, pp. 124–135. Cited by: TABLE III, TABLE IV.
  • [26] C. Pu, R. Song, R. Tylecek, N. Li, and R. Fisher (2019-02) SDF-man: semi-supervised disparity fusion with multi-scale adversarial networks. Remote Sensing 11 (5), pp. 487. External Links: ISSN 2072-4292, Link, Document Cited by: TABLE I, TABLE II.
  • [27] G. Ros, S. Ramos, M. Granados, A. Bakhtiary, D. Vazquez, and A. M. Lopez (2015) Vision-based offline-online perception paradigm for autonomous driving. In IEEE WCACV, pp. 231–238. Cited by: §I.
  • [28] P. Salembier, A. Oliveras, and L. Garrido (1998) Antiextensive connected operators for image and sequence processing. IEEE Transactions on Image Processing 7 (4), pp. 555–570. Cited by: §I, §II-A.
  • [29] T. Sattler, R. Tylecek, T. Brox, M. Pollefeys, and R. B. Fisher (2017) 3d reconstruction meets semantics–reconstruction challenge 2017. In ICCV Workshop, Venice, Italy, Tech. Rep, Cited by: TABLE I.
  • [30] D. Scharstein and R. Szeliski (2002) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision 47 (1-3), pp. 7–42. Cited by: §I.
  • [31] K. Schmid, T. Tomic, F. Ruess, H. Hirschmüller, and M. Suppa (2013) Stereo vision based indoor/outdoor navigation for flying robots. In IEEE/RSJ IROS, pp. 3955–3962. Cited by: §I.
  • [32] X. Song, X. Zhao, L. Fang, and H. Hu (2019) EdgeStereo: an effective multi-task learning network for stereo matching and edge detection. arXiv preprint arXiv:1903.01700. Cited by: TABLE II.
  • [33] N. Strisciuglio, R. Tylecek, N. Petkov, P. Bieber, J. Hemming, E. van Henten, T. Sattler, M. Pollefeys, T. Gevers, T. Brox, and R. B. Fisher (2018) TrimBot2020: an outdoor robot for automatic gardening. In 50th International Symposium on Robotics, Cited by: TABLE I, §III-B.
  • [34] C. Sun (1997) A fast stereo matching method. In DICTA, pp. 95–100. Cited by: §I.
  • [35] S. Todorovic and N. Ahuja (2008) Region-based hierarchical image matching. International Journal of Computer Vision 78 (1), pp. 47–66. Cited by: §I, §II-A.
  • [36] R. Tylecek, T. Sattler, H. Le, T. Brox, M. Pollefeys, R. B. Fisher, and T. Gevers (2019) The second workshop on 3d reconstruction meets semantics: challenge results discussion. In ECCV 2018 Workshops, pp. 631–644. External Links: ISBN 978-3-030-11015-4 Cited by: TABLE I.
  • [37] J. Valentin, A. Kowdle, J. T. Barron, N. Wadhwa, M. Dzitsiuk, M. Schoenberg, V. Verma, A. Csaszar, E. Turner, I. Dryanovski, et al. (2018) Depth from motion for smartphone ar. In SIGGRAPH Asia, pp. 193. Cited by: TABLE III, TABLE IV.
  • [38] J. Weng, N. Ahuja, T. S. Huang, et al. (1988) Two-view matching.. In ICCV, Vol. 88, pp. 64–73. Cited by: §II-A.
  • [39] M. H. Wilkinson (2011) A fast component-tree algorithm for high dynamic-range images and second generation connectivity. In IEEE ICIP, pp. 1021–1024. Cited by: §II-A.
  • [40] Q. Yang, P. Ji, D. Li, S. Yao, and M. Zhang (2014) Fast stereo matching using adaptive guided filtering. Image and Vision Computing 32 (3), pp. 202–211. Cited by: §II-A.
  • [41] K. Yoon and I. S. Kweon (2006) Adaptive support-weight approach for correspondence search. IEEE Trans. Pattern Anal. Mach. Intell (4), pp. 650–656. Cited by: §I.
  • [42] J. Zbontar, Y. LeCun, et al. (2016) Stereo matching by training a convolutional neural network to compare image patches.. J MACH LEARN RES 17 (1-32), pp. 2. Cited by: §I.
  • [43] F. Zhang, V. Prisacariu, R. Yang, and P. Torr (2019) GA-net: guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [44] K. Zhang, J. Lu, and G. Lafruit (2009) Cross-based local stereo matching using orthogonal integral images. IEEE transactions on circuits and systems for video technology 19 (7), pp. 1073–1079. Cited by: §I.