Stereo vision is an effective technique for depth estimation with broad applicability in autonomous urban and highway driving. While various deep learning-based approaches have been developed for stereo, the input data from a binocular setup with a fixed baseline are limited. Addressing such a problem, we present an end-to-end network for processing the data from a trinocular setup, which is a combination of a narrow and a wide stereo pair. In this design, two pairs of binocular data with a common reference image are treated with shared weights of the network and a mid-level fusion. We also propose a Guided Addition method for merging the 4D data of the two baselines. Additionally, an iterative sequential self-supervised and supervised learning on real and synthetic datasets is presented, making the training of the trinocular system practical with no need to ground-truth data of the real dataset. Experimental results demonstrate that the trinocular disparity network surpasses the scenario where individual pairs are fed into a similar architecture. Code and dataset: https://github.com/cogsys-tuebingen/tristereonet.READ FULL TEXT VIEW PDF
While different solutions exist for estimating the depth, stereo matching is the most conforming to various use-cases. Passive recovery of depth maps via stereo has gained attention in the computer vision community for the past three decades. Estimating depth from images can benefit many real-world applications, from which autonomous driving and robot navigation are the most prominent ones. Other active technologies for depth estimation, like Laser Imaging Detection and Ranging (LiDAR) sensors, measure the depth based on the travel time of a light beam emitted by the device. However, in addition to its costly setup, LiDAR yields a sparse depth map, which is also vulnerable to weather conditions.
Estimating depth via stereo is computed by finding the disparities between the matching points in the rectified images, through which it gets straightforward to estimate the depth via triangulation. For this, two images are typically taken into account, a binocular setting. There has been an abundance of traditional and deep learning-based strategies to estimate the disparity from a pair of images.
Before deep learning, various hand-engineered features, like Sum of Absolute Difference or Census Transform [zabih1994non] were used for finding matching points and cost volume computation, followed by a regularization module, such as the renowned Semi Global Matching (SGM) method [hirschmuller2005accurate]. On the other hand, deep learning-based methods try to adapt a network for either some steps of stereo vision [vzbontar2016stereo, seki2017sgm] or the whole pipeline as an end-to-end technique [kendall2017end, chang2018pyramid, guo2019group, zhang2019ga]. The latter group has highly boosted the performance of stereo vision in terms of accuracy. The most promising strategies in this regard are the models in which 3D convolutions are utilized on top of a 4D cost volume data [kendall2017end, chang2018pyramid, guo2019group]. Nevertheless, the applicability of these approaches in real-world scenarios is still an issue due to their bias towards the content of the training images, including the varied object distances to the camera.
Using multiple images for depth estimation is another alternative, which can bring in more accurate results by providing more visual cues of the scene. Moreover, using more than one fixed baseline is important for many applications, like autonomous driving. This is because while it is essential to get the depth information for close-range distances through a narrow baseline, in low/moderate-speed urban driving settings with more close vehicles or pedestrians, it is undoubtedly vital to infer the accurate depth in speedy driving scenarios in the highways via a wider baseline. Note that depth resolution decreases quadratically with depth in a fixed baseline [gallup2008variable].
In this work, we design a three-view stereo setup (Fig. 1) and propose a deep network (Fig. 2) for estimating the disparity by three inputs. Our work considers two baselines of this trinocular setup with a shared reference image to obtain accurate disparity maps. To the best of our knowledge, the proposed model is the first that processes the multi-baseline trinocular setting in an end-to-end deep learning-based manner. The previous multi-baseline stereo works [honegger2017embedded, kallwies2018effective, kallwies2020triple] rely on traditional stereo approaches.
The main contributions of this work are summarized as follows: i) We design a horizontally-aligned multi-baseline trinocular stereo setup for more accurate depth estimation. Accordingly, an end-to-end deep learning-based model is proposed for processing the 3-tuple input. ii) We propose a new layer for merging the disparity-related 4D data for a mid-level fusion. We also investigate other levels and methods of fusion. iii) An iterative sequential self-supervised and supervised learning scheme is proposed to make the design practical to new real-world scenarios with no disparity annotations available. iv) We build a synthetic dataset for the trinocular design together with the ground-truth information, which is publicly available.
Classical Stereo Vision. Classical algorithms for stereo vision can mainly be divided into three modules: matching cost computation, cost regularization, and disparity optimization. For cost regularization, local approaches calculate disparities based on the neighboring pixels [chen2001fast, muhlmann2002calculating] with the drawback of the sensitivity to occlusions and uniform texture. Global methods look globally to the disparity changes for higher accuracy owing to the utilization of non-local constraints. These estimations, though, lead to higher computational complexity. Semi-global approaches examine both locally to predict better disparities for small regions and also globally to estimate based on the overall content of the images [hirschmuller2005accurate]. Later on, the classical work mostly focused on improving the semi-global estimations in terms of accuracy and speed [gehrig2007improving, michael2013real].
Deep Learning-based Stereo. With the rise of deep learning, stereo matching continued to be reformed by these modern techniques. Following the general paradigm for stereo reconstruction, deep models can be divided into two categories: the methods that formulate one or some of the steps with a deep learning framework [vzbontar2016stereo, batsos2018cbmv, seki2017sgm], and the approaches that transfer the full process in an end-to-end scheme [mayer2016large, kendall2017end, chang2018pyramid, guo2019group, zhang2019ga, shamsafar2021mobilestereonet]. Following the recent research, our model is also an end-to-end one, processing a 3-tuple sample.
Multi-view Stereo. Depth reconstruction can be conducted in a multi-view setting as well. The advantage of this strategy is the improved robustness, to occlusion or surface texture. Since multi-view stereo is usually designed to deal with a large number of viewpoints, these algorithms are developed differently compared to the classical two-view stereo vision techniques [furukawa2015multi]. As an example, a related dataset is DTU [aanaes2016large] consisting of 49 or 64 captured images per scene. More importantly, in multi-view methods, the images are captured at unconstrained camera poses, restricting the possibility of fusing the information from different pairs of images. In our multi-view design, we have an array of axis-aligned cameras, in which the matching pixels locate at the same horizontal line. As a result, the fusion of the stereo pairs is achievable.
Multi-baseline Stereo. This concept was initially investigated back in 1993 [okutomi1993multiple] to benefit from narrow and wide baselines. In [okutomi1993multiple], the fusion was applied after computing the Sum of Squared Distances (SSD) of images. More recently, authors in [kallwies2018effective, kallwies2020triple] extended the standard two-view stereo to a three-view setup by adding a camera on top of the left one. This L-shape configuration creates a vertical and a horizontal baseline. In these works, classical stereo methods, Census Transform and SGM, were utilized with disparity-level and cost volume-level fusion. An FPGA-based multi-baseline stereo system with four cameras was developed in [honegger2017embedded] via Census Transform and with no regularization like SGM. The authors showed that their setup is better in recovering fine structures than a binocular setup with SGM. From this viewpoint, we develop a multi-baseline stereo in a full deep learning-based method.
Two-view stereo is a low-cost strategy with dense disparity prediction and higher depth range and resolution than LiDAR-based and monocular depth estimation techniques. However, depending on the operating environment, it suffers from limitations coming from the fixed baseline. In standard stereo, depth resolution drops quadratically with depth [gallup2008variable], making the narrow baseline suitable for closer objects and the wide baseline more fitting for far range. Our motivation for using three cameras for stereo is to leverage both the narrow and the wide baselines for accurate depth estimation. Additionally, in such a setup, the visual data of closer objects that are missed in the field of view of the wider baseline can be recovered by the narrow one. This formulation is particularly necessary for driving scenarios where diversified near- and far-range objects appear.
The schematic of our multi-baseline trinocular setup with horizontally-aligned cameras is illustrated in Fig.1. As identical cameras are located with known and constrained displacement in parallel, not only can stereo matching be obtained between different camera pairs, but we can also fuse these data for accurate and robust prediction. In our formulation, we consider two left-middle (/) and left-right (/) baselines with the left image as reference. Note that it is possible to consider other stereo pairs or other reference images. Still, as we evaluate our method in driving scenarios, we fuse the information from the driver’s viewpoint (for the right-hand traffic).
Theoretically, the disparities (displacements) between the matching points in left-middle and left-right pairs depend on the baselines. That is, given , then , with as the notation for disparity. This can be proved via triangulation for a fixed object.
In this system, both the LM and LR pairs contribute to disparity estimation. More importantly, their fusion provides more constraints to the problem by needing to satisfy . Hence, this setup benefits from the wide baseline for more accurate estimation of distant objects and the narrow baseline for depth estimation of closer objects. Note that close objects may be missed in the joint field of view of the cameras with a wide baseline.
The TriStereoNet architecture is depicted in Fig. 2. We use GwcNet [guo2019group]
as our model backbone, which was originally proposed for standard binocular stereo. The network is coarsely divided into four main modules: feature extraction, cost volume construction, pre-hourglass (43D convolutions), and 3D convolutional hourglass as encoder-decoder. We suffice the encoder-decoder to a single hourglass architecture for efficiency.
In our formulation for three views, the ResNet-like feature extraction [chang2018pyramid, guo2019group] is shared for the three input images (). For an input image of size , the size of each feature data () is . After computing the three feature data for the 3-tuple sample, two cost volumes are computed by group-wise correlation [guo2019group] (with number of groups as 40) for the left-middle pair () and the left-right pair (), outputting two 4D cost data of size (). Note that is the maximum disparity range, which is assumed to be 192 for the original input image size.
Aligning the Cost Volumes. One notable point is that the baselines of the left-middle and left-right pairs are different. As a result, the disparity range of the two cost volumes should be aligned for fusion. Accordingly, we need to consider to align the disparity values with . Here, stand for spatial dimension and for feature dimension of the 4D cost data. In our setup, the ratio between the baselines is . To compute at non-integer values of the disparity dimension, we need to interpolate the values across the corresponding dimension. To this end, we utilize natural cubic splines with a function as eq. 1 for each sub-interval (). For more details on how the related coefficients can be computed, we refer the reader to [mckinley1998cubic].
Fusion of Wide- & Narrow-Baseline Data. An important assumption for this fusion is that all three images are rectified. At this stage, the main questions are where and how to apply the fusion of the narrow- and wide-baseline data. For this, we propose a Guided Addition (GA) module for merging two streams of the 4D disparity data after pre-hourglass. Note that in this architecture, there are mainly three levels for fusion (Fig. 3): i) after cost volume computation, ii) after pre-hourglass, and iii) after hourglass. If data after cost volume computation () are processed with a few more convolutions, with pre-hourglass module, their aggregation obtains higher accuracy. Fusion after hourglass also outperforms direct fusion of the cost volumes; however, the complexity increases as the network operates two 4D data (instead of one) in the hourglass with all heavy 3D convolutions.
shows our proposed Guided Addition module for fusing the 4D data. After applying depth-wise 3D convolution across the feature dimension and 3D batch normalization, this layer merges the data by addition. The data size is retained with kernel size as 3, stride as 1, and the same number of channels (features).
We investigated other manners of fusion in cost volume fusion, addition, average, concatenation, maximization and top feature selection. The last method gets the largest elements from the two data across their feature dimension. Also, we examined average and top feature selection for pre-hourglass and hourglass fusion. However, with regard to accuracy and efficiently, we adopted Guided Addition after pre-hourglass for fusion,(c.f. Table 3). We also observed that fusion by addition and average in cost level, and top feature selection in pre-hourglass and hourglass levels do not outperform the single wide baseline (LR), indicating the impact of the appropriate fusion method.
By building a trinocular synthetic dataset with ground-truth information, we can apply supervised learning via a loss function between the estimated and the ground-truth disparity maps. To this end, we employ Huber loss as eq.3, with and as the ground-truth and estimated disparity maps, and as the number of image pixels. is the threshold for the scaled L1 and L2 loss. We will show that the value of affects the performance. In previous works on disparity estimation [chang2018pyramid, guo2019group, zhang2019ga, duggal2019deeppruner, shen2021cfnet], smooth L1 loss is used for this goal.
To demonstrate the performance of TriStereoNet on real-world data and since providing the ground-truth labels for a real dataset is costly, unsupervised learning via a self-supervision loss is utilized. In particular, we consider photometric loss and disparity smoothness loss. To be more precise, we reconstruct the reference image via warping the target one using the estimated disparity. We then use a combination of SSIM[wang2004image] and L1 loss as in [godard2017unsupervised] (eq. 4) to measure image discrepancy between the reconstructed and the original image.
is the reconstructed image. We set and use a SSIM with a block filter. The reconstructed image is computed according to the disparity map and the baseline. We consider two photometric losses corresponding to the reconstructed left image as eq. 5 and will show that outperforms . Note that for warping the middle image to reconstruct the left image, we need to use because of .
For disparity smoothness loss, we adopt the edge-aware smoothness loss in [heise2013pm] (eq. 6) to encourage the disparity to be locally smooth. This is an L1 loss on the disparity gradients weighted by image gradients.
Final Loss. The final loss is as follows:
where are loss coefficients. Empirically, the coefficients are set as . We formulate this loss as a hybrid supervised and self-supervised training. That is, when training on the synthetic dataset, the photometric loss is ignored, , and when the real dataset is used, .
Iterative Sequential Learning Scheme. Here, we propose a training approach, which is highly efficient with no need for the ground-truth disparity map of the real dataset. First, we initialize the training by self-supervised learning on the real dataset with and loss functions. Then, moving on with the synthetic dataset, we employ and losses for supervised learning. This way, we iteratively optimize the network with self-supervised and supervised learning on the real and synthetic datasets, respectively.
The advantage of this technique is three-fold: Firstly, we can efficiently train on the real-world data where no ground-truth is available. Secondly, the supervision provided via the synthetic dataset assists the network in learning fine details as considering only image reconstruction loss () presents a coarse matching. Finally, self-supervision with no ground-truth helps the network learn the underlying principles of the trinocular setup, hindering the model from overfitting to the ground-truth labels.
Synthetic Dataset. We generated a synthetic dataset using CARLA [dosovitskiy2017carla], consisting of RGB images of three cameras on an axis such that the middle camera is centered in the left-right baseline (38.6 ). This dataset spans various features provided in CARLA, weather condition, day time, traffic, number of people, and location. Out of the built 25 configurations, the dataset includes 9649/2413 training/test samples. These images with a resolution of are rectified, and stereo matching can be performed using any pair of viewpoints. Based on the proposed model and assuming the left image as the shared reference, we only use the left-middle and left-right pairs. Figure 5 shows a 3-tuple sample of this dataset embedded with horizontal lines.
Real Dataset. For the real dataset, we consider the trinocular set of images collected by [HER11, SCH11]. This dataset is a collection of tricamera stereo sequences with eight sets as Harbour bridge, Barriers, Dusk, Queen street, People, Midday, Night and Wiper. We ignore the Dusk and Night sets for their too bright/dark illumination. The images are in 10-bit gray-scale with a resolution of . Similar to our CARLA dataset, each 3-tuple sample satisfies the standard epipolar geometry (Fig. 5). From all the samples, we have split the dataset into 1920/480 training/test samples. This dataset does not include ground-truth disparity maps.
Evaluation Metrics. We evaluate the performance of TriStereoNet in terms of, i) EPE or average End-Point-Error: the mean of absolute error among the valid pixels, ii) D1: the percentage of pixels whose estimation error is or of the ground-truth disparity, iii) px-1: the portion of pixels for which absolute error is , iv) MRE or Mean Relative Error [van2006real]: the mean of absolute error among the valid pixels divided by the ground-truth disparity, iv) px-re-1: we define this metric as the percentage of pixels for which MRE is . This metric is actually the average BMPRE defined in [cabezas2012bmpre], which integrates the benefits of px-1 and MRE. Both MRE and px-re-1 consider depth (and not disparity) error.
In recent works for stereo matching, only EPE and D1 measures are evaluated. We believe considering MRE and px-re-1 metrics are equally important as they better represent the actual estimation error in terms for depth (and not disparity), which is the ultimate goal of stereo matching. Note that EPE, D1, and px-1 are incompetent in yielding higher error values for larger triangulation errors [van2006real].
Implementation Details. For training, random crops of are used for both of the datasets. Testing is evaluated on crops of
of the synthetic images. With a batch size of 8, we trained our model on four Nvidia GeForce GTX 1080Ti. For the synthetic dataset, the learning rate starts from 0.001 and is downscaled by a factor of 2 after epochs(in 30 epochs) with Adam optimizer [kingma2014adam]. As for the real dataset, we train the model for 60 epochs with a learning rate downscaled by 2 after epochs . We also set for the Huber loss.
Quantitative and Qualitative Results. The evaluation on the synthetic dataset is presented in Table 1. We applied four iterations for iterative sequential learning. Namely, each iteration starts by learning on the real dataset (via self-supervision) and then on the synthetic dataset (via supervised training).
From the first iteration to the 4th round, performance improves by approximately 12%, 15%, 13%, 16%, 20% reduction in EPE, D1, px-1, MRE, and px-re-1 measures, respectively. This gain decreases after the 4th iteration. In Fig. 6, we can see the evaluation based on EPE for different iterations of the sequential training. Note that although the learning converges to a reasonable degree in each case, restarting the learning by self-supervision on the real dataset enhances the supervised training on the synthetic dataset. Also, note how this learning mechanism surpasses the vanilla training, from scratch with no self-supervision.
Figure 7 depicts some qualitative results together with the error maps on the synthetic dataset. It is clear that the mechanism of iterative sequential training enhances the estimations. The performance on the real dataset with their 3D reprojections are presented in Fig. 8. Note that this dataset does not have ground-truth information. Besides, these images are challenging as they are different from the synthetic dataset, except they are both driving scenarios.
We evaluate how TriStereoNet performs on the binocular dataset in case of missing the viewpoint of the middle camera. For this, we assume the right image works as the middle one as well, and finetune the network on 159/40 training/validation split from the KITTI 2015 dataset [menze2015object]. The evaluation results are tabulated in Table. 2, including the error rates and the computational complexity in terms of number of operations (MAC) and parameters. We can see that TriStereoNet is capable of estimating the disparity in binocular setup as well, with comparable accuracy and yet with less complexity than binocular models. Notably, it surpasses GA-Net-11 with 40% fewer GigaMACs and is competitive with GA-Net-deep with 66%/36% fewer operations/parameters. We believe the reason lies in multi-view and multi-baseline pre-training and also self-supervision, which help the model to better learn the principles of stereo matching with more constraints coming from the setup. Figure 9 shows some qualitative results of the validation set.
Fusion Level and Fusion Method. We first analyze the effect of the fusion level (location of the fusion) and also the method of fusion for combining the narrow- and wide-baseline data. Table 3 shows the related results when models are only trained on the synthetic dataset. In all of these experiments, the models are trained with 30 epochs, and the best checkpoint is selected based on the least EPE on the test set.
In the cost level fusion, the proposed Guided Addition layer outperforms the other fusion techniques (except in EPE). We also analyzed average and simple addition in this level but discarded them as they are no better than the single wide baseline. This gain in performance increases when fusion is applied after the pre-hourglass and hourglass. Compared with average fusion, we see that not only the level of fusion is important, but the fusion method is similarly affecting the performance.
The method shows more accurate results in terms of MRE and px-re-1 errors. We believe that this is because the hourglass module further processes two data streams. This regularization aggregates the disparity information regarding pixels’ locality, avoiding the many irrelevant spurious values in the final estimation. Nevertheless, we selected for merging the data as it comes with less computational complexity than in . Note that requires the processing of two data flow (instead of one) in a heavy encoder-decoder with 3D convolutions.
Trinocular Binocular. To prove the performance boost provided by the trinocular setup in comparison to the binocular case, we have exploited a similar network architecture for a single pair of images, either the left-middle or the left-right. Results are reported in Table 3. TriStereoNet, which fuses the two baseline streams, outperforms the single pairs in terms of all metrics.
Learning Scheme. Table. 4 presents the impact of self-supervised initialization with the real dataset before training on the synthetic dataset in a supervised manner. It also compares the two methods by which we can reconstruct the reference image. It is clear that self-supervised initialization improves the results, particularly when is used for self-supervision. Figure 10 shows the qualitative comparison of these approaches. We also evaluated the case with summation of both the photometric losses of and , but it was not better than alone.
Disparity Loss. Last but not least, we evaluated the effect of the Huber loss for comparing disparity maps. Table 5 shows that Huber loss is outperforming smooth L1 loss for both binocular and trinocular settings. Note that Huber loss is equivalent to smooth L1 loss when . It also confirms our choice of as 0.25.
The paper proposed a deep end-to-end network for disparity estimation in a multi-baseline trinocular setup. The pipeline processes two pairs of images from an axis-aligned three-camera configuration with narrow and wide baselines. We introduced a new layer for effectively merging the information of the two baselines. A synthetic dataset was generated to help with a proposed iterative sequential learning of real and synthetic datasets. With this learning mechanism, we can train on a real dataset with no ground-truth information. Experiments show that the proposed method outperforms the disparity map estimated by each image pair. This multi-baseline deep model is promising for building safe and reliable autonomous driving applications, where the image content is diversified and changeable in terms of the distance to the camera.