1 Introduction
Depth from stereo vision has been heavily studied in computer vision field for the last few decades. Depth estimation has various applications in autonomous driving, dense reconstruction and 3D objects and human tracking. Virtual Reality and Augmented Reality systems require depth estimations to build dense spatial maps of the environment for interaction and scene understanding. For proper rendering and interaction between virtual and real objects in an augmented 3D world, the depth is expected to be both dense and correct around object boundaries. Depth sensors such as structured light and time of flight sensors are often used to build such spatial maps of indoor environments. These sensors often use illumination sources which require power and space that exceeds the expected budget of an envisioned AR system. Since these sensors use infrared vision, they do not work well in bright sun light environment or in presence of other infrared sources.
On the other hand, the depth from stereo vision systems have a strong advantage of working in both indoors and in sunlight environments. Since these systems use passive image data, they do not interfere with each other or with the environment materials. Moreover, the resolution of passive stereo systems is typically greater than the sparse patterns used in structured light depth sensors, so these methods have capabilities to produce depth with accurate object boundaries and corners. Due to recent advancements in camera and mobile technology the image sensors have dramatically reduced in size and have significantly improved in resolution and image quality. All these qualities makes passive stereo system a better fit for being a depth estimator for a AR or VR system. However, stereo systems have their own disadvantages, such as ambiguous predictions in textureless or repeating/confusing textured surfaces. In order to deal with these homogeneous regions traditional methods make use of handcrafted functions and optimize the parameters globally on the entire image. Recent methods use machine learning to derive the functions and it’s parameters from the data that is used in training. As these functions tend to be highly nonlinear, they tend to yield reasonable approximations even on the homogeneous and reflective surfaces.
Our key contributions are as follows:
• Novel Disparity Refinement Network: The main motivation of our work is to predict geometrically consistent disparity maps for stereo input that can be directly used by TSDFbased fusion system like KinectFusion [15] for simultaneous tracking and mapping. Surface normals are an important factor in fusion weight computation in KinectFusionlike systems, and we observed that state of the art stereo systems such as PSMNet produces disparity maps that are not geometrically consistent which negatively affect TSDF fusion. To address this issue, we propose a novel refinement network which takes geometric error , photometric error and unrefined disparity as input and produces refined disparity (via residual learning) and the occlusion map.
• 3D Dilated Convolutions in Cost Filtering: State of the art stereo systems such as PSMNet[2] and GCNet[7] that use 3D cost filtering approach use most of the computational resources in the filtering module of their system. We observe that using 3D dilated convolutions in all three dimensions i.e (width, height, and disparity channels) in a structure shown in Fig. 4 gave us better results with less compute (refer to Table.1).
• Other Contributions: We observe that Vortex Pooling compared to spatial pyramid pooling (used in PSMNet) provides better results (refer to ablation study 2). We found the exclusion masks used to filter nonconfident regions of ground truth for finetuning our model as discussed in Sec 4.4 to be very useful in obtaining sharp edges and fine details in disparity predictions.
We achieve 1.3  2.1 cm RMSE on 3D reconstructions of three scenes that we prepared using structured light system proposed in [25].
2 Related Work
Depth from stereo has been widely explored in the literature, we refer interested readers to surveys and methods described in [20]. Broadly speaking stereo matching can be categorized into computation of cost metrics, cost aggregation, global or semiglobal optimization [4] and refinement or filtering processes. Traditionally global cost filtering approaches used discrete labeling methods such as Graph Cuts [11] or used belief propagation techniques described in [10] and [1]. Total Variation denoising [19] has been used in cost filtering by methods described in [29], [16] and [14].
The state of the art in disparity estimation techniques use CNNs. MCCNN [30] introduced a Siamese network to compare two image patches. The scores on matching was used along with the semiglobal matching process [4] to predict consistent disparity estimation. DispNet [13] demonstrates an endtoend disparity estimation neural network with a correlation layer (dot product of features) for stereo volume construction. Liang et al. [12] improved DispNet by introducing novel iterative filtering process. GCNet [7] introduces a method to filter 4D cost using a 3D cost filtering approach and the soft argmax process to regress depth. PSMNet [2] improved GCNet by enriching features with better global context using pyramid spatial pooling process. They also show effective use of stacked residual networks in cost filtering process.
Xie et al. [26] introduce vortex pooling which is an improvement of the atrous spatial pooling approach used in Deep lab [3]. Atrous pooling uses convolutions with various dilation steps to increase receptive fields of a CNN filter. The vortex pooling technique uses average pooling in grids of varying dimensions before dilated convolutions to utilize information from the pixels which were not used in bigger dilation steps. The size of average pool grids grows with the increase in dilation size. We use the feature extraction described in Vortex pooling and improve the cost filtering approach described by PSMNet.
Our proposed refinement network takes geometric error , photometric error and unrefined disparity as input and produces refined disparity (via residual learning) and the occlusion map. Refinement procedures proposed in CRL [17], iResNet [12], StereoNet [8] and FlowNet2 [5] only use photometeric error (either in image or feature domain) as part of the input in the refinement networks. To the best of our knowledge we are the first to explore the importance of geometric error and occlusion training for disparity refinement.
3 Algorithm
In this section we describe our architecture that predicts disparity for the input stereo pair. Instead of using a generic encoderdecoder CNN we break our algorithm into feature extraction, cost volume filtering and refinement procedures.
3.1 Feature Extraction
The feature extraction starts with a small shared weight Siamese network which takes input as images and encodes the input to a set of features. As these features will be used for stereo matching we want them to have both local and global contextual information. To encode local spatial information in our feature maps we start by downsampling the input by use of convolutions with stride of 2. Instead of having a large
convolution we use three filters where first convolution has stride of 2. We bring the resolution to a fourth by having two of such blocks. In order to encode more contextual information we choose Vortex Pooling [26] on the learned local feature maps Fig. 3. Each of our convolutions are followed by batch normalization and ReLU activation except on the last 3x3 convolution on the spatial pooling output. In order to keep the feature information compact we keep the feature dimension size as 32 throughout the feature extraction process.
3.2 Cost Volume Filtering
We use the features extracted in the previous step to produce a stereo cost volume. While several approaches in the literature ([7],[13]) use concatenation or dot products of the stereo features to obtain the cost volume, we found simple arithmetic difference to be just as effective.
While the simple argmin on the cost should in principle lead to the correct local minimum solution, it has been shown several times in literature [16], [4],[20]
that it is common for the solution to have several local minima. Surfaces with homogeneous or repeating texture are particularly prone to this problem. By posing the cost filtering as a deep learning process with multiple convolutions and nonlinear activations we attempt to resolve these ambiguities and find the correct local minimum.
We start by processing our cost volume with a convolution along the width, height and depth dimensions. We then reduce the resolution of the cost by a convolution with stride of 2 followed by convolutions with dilation 1, 2, 4 in parallel. A convolution on the concatenation of the dilated convolution filters is used to combine the information fetched from varying receptive fields.
Residual learning has been shown to be very effective in disparity refinement process so we propose a cascade of such blocks to iteratively improve the quality of our disparity prediction. We depict the entire cost filtering process as Dilated Residual Cost Filtering in Fig. 4. In this figure notice how our network is designed to produce disparity maps labeled as .
Our network architecture that supports refinement predicts disparities for both left and right view as separate channels in disparity predictions . Note that we construct the cost for both left and right views and concatenate them before filtering; this ensures that the cost filtering method is provided with cost information for both views. Please refer to Table. 8 for exact architecture details.
3.3 Disparity Regression
In order to have a differentiable argmax we use soft argmax as proposed by GCNet [7]. For each pixel the regressed disparity estimation is defined as a weighted softmax function:
(1) 
where is the cost at pixel and is the maximum disparity. The loss for each of the proposed disparity maps (as shown in Fig. 4) in our dilated residual cost filtering architecture, relies on the Huber loss and is defined as:
(2) 
where and are the estimated and ground truth disparity at pixel , respectively and is the total number of pixels. The total data loss is defined as:
(3) 
where is the weight for each disparity map .
3.4 Disparity Refinement
In order to make the disparity estimation robust to occlusions and view consistency we further optimize the estimate. For brevity we label the third disparity prediction ( = 3) described in Sec. 3.2 for left view as and for right view as . In our refinement network we warp the right image to left view via the warp and evaluate the image reconstruction error map for the left image as:
(4) 
By warping to the left view and using the left disparity we can evaluate the geometric consistency error map as:
(5) 
While we could just reduce these error terms directly into a loss function, we observed significant improvement by using photometric and geometric consistency error maps as input to the refinement network as these error terms are only meaningful for non occluding pixels (only pixels for which the consistency errors can be reduced).
Our refinement network takes as input left image , left disparity map , image reconstruction error map and geometric error map . We first filter left image and reconstruction error and left disparity and geometric error map independently by using one layer of convolution followed by batch normalization. Both these results are then concatenated and followed by atrous convolution [18] to sample from a larger context without increasing the network size. We used dilations with rate 1, 2, 4, 8, 1, and 1 respectively. Finally a single convolution without ReLU or batch normalization is used to output an occlusion map and a disparity residual map . Our final refined disparity map is labeled as . We demonstrate our refinement network in Fig. 5 and provide exact architecture details in Table. 7.
We compute the cross entropy loss on the occlusion map as
(6) 
where is the ground truth occlusion map.
The refinement loss is defined as
(7) 
where is the value for a pixel in our refined disparity map and is the total number of pixels.
Our total loss function is defined as
(8) 
where and are scalar weights.
3.5 Training
We implemented our neural network code in PyTorch. We tried to keep the training of our neural network similar to one described in PSMNet
[2] for ease of comparison. We used Adam optimizer [9] with = 0.9 and = 0.999 and normalized the image data before passing it to the network. In order to optimize the training procedure we cropped the images to 512x256 resolution. For training we used a minibatch size of 8 on 2 Nvidia TitanXp GPUs. We used = 0.2, = 0.4, = 0.6, = 1.2 and = 0.3 weights in our proposed loss functions Eq. 3 and Eq. 8.4 Experiments
We tested our architecture on rectified stereo datasets such as SceneFlow, KITTI 2012, KITTI 2015 and ETH3D. We also demonstrate the utility of our system in building 3D reconstruction of indoor scenes. See the supplementary section for additional visual comparisons.
4.1 SceneFlow Dataset
SceneFlow [13] is a synthetic dataset with over stereo pairs for training and around stereo pairs for evaluation. We use both left and right ground truth disparities for training our network. We compute the ground truth occlusion map by defining as occluded any pixel with disparities inconsistency larger than 1 px. This dataset is challenging due to presence of occlusions, thin structures and large disparities.
In Fig. 6 we visually compare our results with PSMNet [2]. Our system infers better structural details in the disparity image and also produces consistent depth maps with significantly less errors in homogeneous regions. We further visualize the effect of our refinement network in Fig. 13.
Table 1 shows a quantitative analysis of our architecture with and without refinement network. StereoDRNet achieves significantly lower end point error while reducing computation time. Our proposed cost filtering approach achieves better accuracy with significantly less compute, demonstrating the effectiveness of the proposed dilated residual cost filtering approach.
Ablation study: In Table 2 we show a complete EPE breakdown for different parts of our network on the SceneFlow dataset. Both vortex pooling and refinement procedure add marginal performance gains. Cotraining occlusion map with residual disparity drastically improves the mean end point disparity error of the final disparity from 0.93 px to 0.86 px. Passing only the photometric error into the refinement network actually degrades the performance.
Method  EPE  Total FLOPS  3DConv FLOPS  FPS 
CRL[17]  1.32      2.1 
GCNet[7]  2.51  8789 GMac  8749 GMac  1.1 
PSMNet[2]  1.09  2594 GMac  2362 GMac  2.3 
Ours  0.98  1410 GMac  1119 GMac  4.3 
OursRef  0.86  1711 GMacs  1356 GMacs  3.6 
Network Architecture  SceneFlow  KITTI2015  
Pooling  Cost Filtering  Refinement  EPE  Val Error(%)  
Pyramid  ✓  1.17  2.28  
Vortex  ✓  1.13  2.14  
Vortex  ✓  ✓  0.99  1.88  
Vortex  ✓  ✓  ✓  0.98  1.74  
Pyramid  ✓  ✓  ✓  1.00  1.81  
Vortex  ✓  ✓  ✓  ✓  1.03    
Vortex  ✓  ✓  ✓  ✓  0.95    
Vortex  ✓  ✓  ✓  ✓  ✓  0.93    
Vortex  ✓  ✓  ✓  ✓  ✓  ✓  0.86   
Pyramid  ✓  ✓  ✓  ✓  ✓  ✓  0.96   
Method  2px  3px  Avg Error  Time(s)  
Noc  All  Noc  All  Noc  All  
GCNET[7]  2.71  3.46  1.77  2.30  0.6  0.7  0.90 
EdgeStereo[22]  2.79  3.43  1.73  2.18  0.5  0.6  0.48 
PDSNet[24]  3.82  4.65  1.92  2.53  0.9  1.0  0.50 
SegStereo[27]  2.66  3.19  1.68  2.03  0.5  0.6  0.60 
PSMNet[2]  2.44  3.01  1.49  1.89  0.5  0.6  0.41 
Ours  2.29  2.87  1.42  1.83  0.5  0.5  0.23 
Method  All(%)  Noc(%)  Time(s)  
D1bg  D1fg  D1all  D1bg  D1fg  D1all  
DNCSS[6]  2.39  5.71  2.94  2.23  4.96  2.68  0.07 
GCNET[7]  2.21  6.16  2.87  2.02  5.58  2.61  0.90 
CRL[17]  2.48  3.59  2.67  2.32  3.12  2.45  0.47 
EdgeStereo[22]  2.27  4.18  2.59  2.12  3.85  2.40  0.27 
PDSNet[24]  2.29  4.05  2.58  2.09  3.69  2.36  0.50 
PSMNet[2]  1.86  4.62  2.32  1.71  4.31  2.14  0.41 
SegStereo[27]  1.88  4.07  2.25  1.76  3.70  2.08  0.60 
Ours  1.72  4.95  2.26  1.57  4.58  2.06  0.23 
Method  All  Noc  
1px  2px  4px  RMSE  1px  2px  4px  RMSE  
PSMNet[2]  5.41  1.31  0.54  0.75  5.02  1.09  0.41  0.66 
iResNet[12]  4.04  1.20  0.34  0.59  3.68  1.00  0.25  0.51 
DNCSS[6]  3.00  0.96  0.34  0.56  2.69  0.77  0.26  0.48 
Ours  4.84  0.96  0.30  0.55  4.46  0.83  0.24  0.50 
4.2 KITTI Datasets
We evaluated our method on both KITTI 2015 and KITTI 2012 datasets. These data sets contain stereo pairs with semidense depth images acquired using a LIDAR sensor that can be used for training. The KITTI 2012 dataset contains 194 training and 193 test stereo image pairs from static outdoor scenes. The KITTI 2015 dataset contains 200 training and 200 test stereo image pairs from both static and dynamic outdoor scenes.
Training and ablation study: Since KITTI data sets contain only limited amount of training data, we fine tuned our model on the SceneFlow dataset. In our training we used 80% stereo pairs for training and 20% stereo pairs for evaluation. We demonstrate the ablation study of our proposed method on KITTI 2015 dataset Table 2. Note how our proposed dilated residual architecture and the use of Vortex pooling for feature extraction consistently improve the results. We did not achieve significant gains by doing refinement on KITTI datasets as these datasets only contain labeled depth for sparse pixels. Our refinement procedure improves disparity predictions using view consistency checks and sparsity in ground truth data affected the training procedure. We demonstrate that data sets with denser training data enabled the training and finetuning of our refinement model.
Results: We evaluated our Dilated residual network without filtering on both these datasets and achieved state of the art results on KITTI 2012 Table 3 and comparable results with best published method on KITTI 2015 Table 4. On KITTI 2015 dataset the three columns “D1bg”, “D1fg” and “D1all” mean that the pixels in the background, foreground, and all areas, respectively, were considered in the estimation of errors. We perform consistently well in “D1bg” meaning background areas, we achieve comparable results with state of art method in all pixels and better results in nonoccluded regions. On KITTI 2012 dataset ”Noc” means non occluded regions and ”All” mean all regions. Notice, that we perform comparable against SegStereo [27] on KITTI 2015 but way better in KITTI 2012 dataset.
4.3 ETH3D Dataset
We again used our pretrained network trained on Sceneflow dataset and finetuned it on the training set provided in the dataset. ETH dataset contains challenging scenes of both outside and indoor environment. According to our Table 5
we perform best on almost half of the evaluation metrics, our major competitor in this evaluation was DNCSS
[6]. Although, we observe that this method did not perform well on KITTI 2015 data set Table 4. Notice, as this data set contained dense training disparity maps of both stereo views we were able to train and evaluate our refinement network on this data set.4.4 Indoor Scene Reconstruction
We use the scanning rig used in recent work [25]
for preparing ground truth dataset for supervised learning of depth and added one more RGB camera to the rig to obtain a stereo image pair. We kept the baseline of the stereo pair to be about 10cm. We trained our StereoDRNet network on SceneFlow as described in section 4.1 and then fine tuned the pretrained network on 250 stereo pairs collected in the indoor area by our scanning rig. We observed that the network to quickly adapted to our stereo rig with a minimal amount of finetuning.
For preparing ground truth depth we found rendered depth from complete scene reconstruction to be a better estimate than the live sensor depth which usually suffers from occlusions and depth uncertainties. Truncated signed distance function (TSDF) was used to fuse live depth maps into a scene as described in [15].
The infraredstructure light depth sensors are known to be unresponsive to dark and highly reflective surfaces. Moreover, the quality of TSDF fusion is limited to the resolution of the voxel size. Hence we expect the reconstructions to be overly smooth in some areas such as table corners or sharp edges of plant leaves. In order to avoid contaminating our training data with false depth estimation, we use a simple photometric error threshold to mask out the pixels from training where the textured model projection color disagrees with the real images. We show one such example in Fig. 7 where glass, mirrors and the sharp corners of the table are excluded from training. Although, the system from Whelan et al. [25] can obtain ground truth planes of mirrors and glass we avoid depth supervision on them in this work as it is beyond the scope of a stereo matching procedure to obtain depth on reflectors.
We demonstrate visualizations of the depth predictions from the stereo pair in Fig. 8. Notice, our prediction is able to recover sharp corners of the table, thin reflective legs of the chair and several thin structures in kitchen dataset as a result of filtering process used in training. It is interesting to see that we recover the top part of the glass correctly but not the bottom part of the glass which suffers from reflections. The stereo matching model simply treats reflectors as windows in presence of reflections.
Results and evaluations: We demonstrate visualizations of full 3D reconstruction of a living room in an apartment prepared by TSDF fusion of the predicted depth maps from our system in Fig. 9. For evaluation study we prepared three small data sets that we refer as “Sofa and cushions” demonstrated in Fig. 1, “Plants and couch” and “Kitchen and bike” demonstrated in Fig. 10. We report pointtoplane root mean squared error (RMSE) of the reconstructed 3D meshes from fusion of depth maps obtained from PSMNet [2] and our refined network. We obtain a RMSE of 1.3 cm on the simpler “Sofa and cushions” dataset. Note that our method captured high frequency structural details on the cushions which were not captured by PSMNet or the structured light sensor. “Plants and couch” represents a more difficult scene as it contained a directed light source casting shadows. For this dataset StereoDRNet obtained 2.1 cm RMSE whereas PSMNet obtained 2.5 cm RMSE. Notice, that our reconstruction is not only cleaner but produces minimal errors in the shadowed areas (shadows cast by book shelf and left plant). “Kitchen and bike” dataset cluttered and contains reflective objects making it the hardest dataset. While our system still achieved 2.1 cm RMSE, the performance of PSMNet degraded to 2.8 cm RMSE. Notice, that our reconstruction contains the faucet (highlighted by yellow box) in contrast to the structured light sensor and PSMNet reconstructions. For all evaluations we used exactly the same training dataset for finetuning our StereoDRNet and PSMNet.
5 Conclusion
Depth estimation from passive stereo images is a challenging task. Systems from related work suffer in regions with homogeneous texture or surfaces with shadows and specular reflections. Our proposed network architecture uses global spatial pooling and dilated residual cost filtering techniques to approximate the underlying geometry even in above mentioned challenging scenarios. Furthermore, our refinement network produces geometrically consistent disparity maps with the help of occlusion and view consistency cues. The use of perfect synthetic data and careful filtering of real training data enabled us to recover thin structures and sharp object boundaries. Finally, we demonstrate that our passive stereo system, when used for building 3D scene reconstructions in challenging indoor scenes, approaches the quality of stateoftheart structured light systems [25].
Supplementary
A Overview
In this supplementary material, we provide additional details of the training and evaluation procedure of our indoor scene reconstruction experiments. We also provide in depth detail of our proposed network architecture and show the effect of the proposed refinement procedure on the reconstruction quality. We share the results of the ablation study on the dilated convolutions used in our cost filtering approach and visualize the comparison of the disparity predictions from our system with state of art methods on KITTI and ETH3D benchmarks.
B 3D Reconstruction Experiments
For all 3D reconstruction experiments and evaluations we used a set of about 200 stereo views shown in Fig. 11 to fine tune the SceneFlow [13]pretrained networks.
We show the textured 3D reconstructions of our indoor scene dataset in Fig. 12. Note that we used KinectFusion [15] to fuse the depth maps into 3D spatial maps. We did not use any structurefrommotion (SfM) or external localization method for estimating camera trajectories. Hence, the camera views visualized in Fig. 12 are the output of the ICP (iterative closest point) procedure used by the KinectFusion [15] system. We used manual adjustment followed by ICP to align the 3D reconstructions wherever necessary for our evaluations.
C Network Details
We provide the network architecture of StereoDRNet in Table. 8. We borrowed ideas on extracting robust local image features from PSMNet [2]. As described in the paper, we use Vortex Pooling [26] for extracting global scene context. In our experiments we found dilation rates 3, 5 and 15 and average grids of size , and to improve performance more in disparity predictions than the one proposed in the original work for semantic segmentation.
3D Dilation in Cost Filtering  SceneFlow  
rate = 1  rate = 2  rate = 4  rate = 8  EPE 
✓  1.13  
✓  ✓  1.03  
✓  ✓  ✓  0.98  
✓  ✓  ✓  ✓  1.01 
In order to show the effectiveness of the proposed dilated convolutions in cost filtering, we conduct an ablation study in Table. 6 on the SceneFlow [13] dataset. We observed that increasing dilation rates improved the quality of predictions. Dilation rates above 4 did not provide any significant gains.
Index  Layer Description  Output 

1  Warp(,)   H x W x 3 
2  concat 1,  H x W x 6 
3  Warp(, )   H x W x 1 
4  concat 3,  H x W x 2 
5  3x3 conv on 2, 16 features  H x W x 16 
6  3x3 conv on 4, 16 features  H x W x 16 
7  concat 5,6  H x W x 32 
813  (3x3 conv, residual block) x 6,  H x W x 32 
dil rate 1,2,4,8,1,1  
14  3x3 conv, 2 features as 14(a) and 14(b)  H x W x 2 
15  : 14(a) +  H x W 
16  O: sigmoid on 14(b)  H x W 
represent refined disparity and occlusion probability respectively.
Index  Layer Description  Output 

1  Input Image  H x W x 3 
Local feature extraction  
2  3x3 conv, 32 features, stride 2  H/2 x W/2 x 32 
34  (3x3 conv, 32 features) x 2  H/2 x W/2 x 32 
57  (3x3 conv, 32 features, res block) x 3  H/2 x W/2 x 32 
8  3x3 conv, 32 features, stride 2  H/4 x W/4 x 32 
922  (3x3 conv, 64 features, res block) x 15  H/4 x W/4 x 64 
2328  (3x3 conv, 128 features, res block) x 6  H/4 x W/4 x 128 
Spatial Pooling  
29  Global Avg Pool on 28, bilinear interp  H/4 x W/4 x 128 
30  Avg Pool 3x3 on 28, conv 3x3, dil rate 3  H/4 x W/4 x 128 
31  Avg Pool 5x5 on 28, conv 3x3, dil rate 5  H/4 x W/4 x 128 
32  Avg Pool 15x15 on 28, conv 3x3,  H/4 x W/4 x 128 
dil rate 15  
33  Concat 22, 28, 29, 30, 31 and 32  H/4 x W/4 x 704 
34  3x3 conv, 128 features  H/4 x W/4 x 128 
35  1 x 1 conv, 32 features without BN  H/4 x W/4 x 32 
and ReLU  
Cost Volume  
36  Subtract left 35 from right 35  D/4 x H/4 x W/4 x 64 
with D/4 shifts,vice versa  
Cost Filtering  
3738  (3x3x3 conv, 32 features) x 2  D/4 x H/4 x W/4 x 32 
39  3x3x3 conv, 32 features, stride 2  D/8 x H/8 x W/8 x 32 
40  3x3x3 conv, 32 features  D/8 x H/8 x W/8 x 32 
41  3x3x3 conv on 39, 32 features  D/8 x H/8 x W/8 x 32 
42  3x3x3 conv on 39, 32 features, dil rate 2  D/8 x H/8 x W/8 x 32 
43  3x3x3 conv on 39, 32 features, dil rate 4  D/8 x H/8 x W/8 x 32 
44  3x3x3 conv on concat(41,42,43),  D/8 x H/8 x W/8 x 32 
32 features  
45  3x3x3 deconv, 32 features, stride 2  D/4 x H/4 x W/4 x 32 
46  Pred1: 3x3x3 conv on 45 + 38  D/4 x H/4 x W/4 x 2 
47  3x3x3 conv on 45, 32 features, stride 2  D/8 x H/8 x W/8 x 32 
48  3x3x3 conv + 40, 32 features  D/8 x H/8 x W/8 x 32 
49  3x3x3 conv on 48, 32 features  D/8 x H/8 x W/8 x 32 
50  3x3x3 conv on 48, 32 features, dil rate 2  D/8 x H/8 x W/8 x 32 
51  3x3x3 conv on 48, 32 features, dil rate 4  D/8 x H/8 x W/8 x 32 
52  3x3x3 conv on concat(49,50,51),  D/8 x H/8 x W/8 x 32 
32 features  
53  3x3x3 deconv, 32 features, stride 2  D/4 x H/4 x W/4 x 32 
54  Pred2: 3x3x3 conv on 53 + 38  D/4 x H/4 x W/4 x 2 
55  3x3x3 conv on 53, 32 features, stride 2  D/8 x H/8 x W/8 x 32 
56  3x3x3 conv + 48, 32 features  D/8 x H/8 x W/8 x 32 
57  3x3x3 conv on 56, 32 features  D/8 x H/8 x W/8 x 32 
58  3x3x3 conv on 56, 32 features, dil rate 2  D/8 x H/8 x W/8 x 32 
59  3x3x3 conv on 56, 32 features, dil rate 4  D/8 x H/8 x W/8 x 32 
60  3x3x3 conv on concat(57,58,59),  D/8 x H/8 x W/8 x 32 
32 features  
61  3x3x3 deconv, 32 features, stride 2  D/4 x H/4 x W/4 x 32 
62  Pred3: 3x3x3 conv on 61 + 38  D/4 x H/4 x W/4 x 2 
Disparity Regression  
63  Bilinear interp of Pred1, Pred2, Pred3  D x H x W x 2 
64  SoftArg Max of 63 to get , ,  H x W x 2 
The proposed refinement network described in Table. 7 is inspired by the refinement procedures proposed in CRL [17], iResNet [12], StereoNet [8], and ActiveStereoNet [31]. We adopted the basic architecture for refinement as described in StereoNet [8] with dilated residual blocks [28] to increase the receptive field of filtering without compromising resolution. This technique was also adopted in recent work on optical flow prediction Pwcnet [23]. We experienced additional gains when using the photometric error and geometric error maps as inputs and cotraining of occlusion maps. Such enhancements in the refinement procedure has never been proposed to the best of our knowledge.
D Effect of Refinement
Our refinement procedure not only improves the overall disparity error but also makes the prediction geometrically consistent. We calculate surface normal maps from disparity/depth maps using the approach described in KinectFusion [15]. We use a surface normal error metric to measure consistency in the disparity predictions (first order derivative). Figures 13 and 14 visualize how our refinement procedure improves the overall structure of objects. In some cases such as in the first comparison in Fig. 13 we observe little improvement in disparity prediction but large improvement in surface normals. Figure 14 demonstrates real scene disparity and derived surface normal predictions and proves that our refinement procedure works well on real world data in presence of shadows and dark lighting conditions. Dense 3D reconstruction methods such as KinectFusion [15] use surface normals to calculate fusion parameters and confidence weights, hence it is important to predict geometrically consistent disparity or normal maps for high quality 3D reconstruction.
References
 [1] Michael Bleyer, Christoph Rhemann, and Carsten Rother. Patchmatch stereostereo matching with slanted support windows. In Bmvc, volume 11, pages 1–11, 2011.

[2]
JiaRen Chang and YongSheng Chen.
Pyramid stereo matching network.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5410–5418, 2018.  [3] LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
 [4] Heiko Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence, 30(2):328–341, 2008.
 [5] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE conference on computer vision and pattern recognition (CVPR), volume 2, page 6, 2017.
 [6] Eddy Ilg, Tonmoy Saikia, Margret Keuper, and Thomas Brox. Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation. In European Conference on Computer Vision (ECCV), 2018.
 [7] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, Peter Henry, Ryan Kennedy, Abraham Bachrach, and Adam Bry. Endtoend learning of geometry and context for deep stereo regression. CoRR, vol. abs/1703.04309, 2017.
 [8] Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Valentin, and Shahram Izadi. Stereonet: Guided hierarchical refinement for realtime edgeaware depth prediction. arXiv preprint arXiv:1807.08865, 2018.
 [9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [10] Andreas Klaus, Mario Sormann, and Konrad Karner. Segmentbased stereo matching using belief propagation and a selfadapting dissimilarity measure. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 3, pages 15–18. IEEE, 2006.
 [11] Vladimir Kolmogorov and Ramin Zabih. Computing visual correspondence with occlusions using graph cuts. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 2, pages 508–515. IEEE, 2001.
 [12] Zhengfa Liang, Yiliu Feng, YGHLW Chen, and LQLZJ Zhang. Learning for disparity estimation through feature constancy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2811–2820, 2018.
 [13] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
 [14] Richard Newcombe. Dense visual SLAM. PhD thesis, Imperial College London, UK, 2012.
 [15] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Realtime dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pages 127–136. IEEE, 2011.
 [16] Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in realtime. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 2320–2327. IEEE, 2011.
 [17] Jiahao Pang, Wenxiu Sun, Jimmy SJ Ren, Chengxi Yang, and Qiong Yan. Cascade residual learning: A twostage convolutional neural network for stereo matching. In ICCV Workshops, volume 7, 2017.
 [18] George Papandreou, Iasonas Kokkinos, and PierreAndré Savalle. Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 390–399, 2015.
 [19] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(14):259–268, 1992.
 [20] Daniel Scharstein and Richard Szeliski. A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. International journal of computer vision, 47(13):7–42, 2002.
 [21] Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multiview stereo benchmark with highresolution images and multicamera videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [22] Xiao Song, Xu Zhao, Hanwen Hu, and Liangji Fang. Edgestereo: A context integrated residual pyramid network for stereo matching. arXiv preprint arXiv:1803.05196, 2018.
 [23] Deqing Sun, Xiaodong Yang, MingYu Liu, and Jan Kautz. Pwcnet: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018.
 [24] Stepan Tulyakov, Anton Ivanov, and Francois Fleuret. Practical deep stereo (pds): Toward applicationsfriendly deep stereo matching. arXiv preprint arXiv:1806.01677, 2018.
 [25] Thomas Whelan, Michael Goesele, Steven J Lovegrove, Julian Straub, Simon Green, Richard Szeliski, Steven Butterfield, Shobhit Verma, and Richard Newcombe. Reconstructing scenes with mirror and glass surfaces. ACM Transactions on Graphics (TOG), 37(4):102, 2018.
 [26] ChenWei Xie, HongYu Zhou, and Jianxin Wu. Vortex pooling: Improving context representation in semantic segmentation. arXiv preprint arXiv:1804.06242, 2018.
 [27] Guorun Yang, Hengshuang Zhao, Jianping Shi, Zhidong Deng, and Jiaya Jia. Segstereo: Exploiting semantic information for disparity estimation. arXiv preprint arXiv:1807.11699, 2018.
 [28] Fisher Yu, Vladlen Koltun, and Thomas A Funkhouser. Dilated residual networks. In CVPR, volume 2, page 3, 2017.
 [29] Christopher Zach, Thomas Pock, and Horst Bischof. A globally optimal algorithm for robust tvl 1 range image integration. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
 [30] Jure Zbontar and Yann LeCun. Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(132):2, 2016.
 [31] Yinda Zhang, Sameh Khamis, Christoph Rhemann, Julien Valentin, Adarsh Kowdle, Vladimir Tankovich, Michael Schoenberg, Shahram Izadi, Thomas Funkhouser, and Sean Fanello. Activestereonet: endtoend selfsupervised learning for active stereo systems. arXiv preprint arXiv:1807.06009, 2018.
Comments
There are no comments yet.