UDFNet: Unsupervised Disparity Fusion with Adversarial Networks

04/22/2019 ∙ by Can Pu, et al. ∙ 8

Existing disparity fusion methods based on deep learning achieve state-of-the-art performance, but they require ground truth disparity data to train. As far as I know, this is the first time an unsupervised disparity fusion not using ground truth disparity data has been proposed. In this paper, a mathematical model for disparity fusion is proposed to guide an adversarial network to train effectively without ground truth disparity data. The initial disparity maps are inputted from the left view along with auxiliary information (gradient, left & right intensity image) into the refiner and the refiner is trained to output the refined disparity map registered on the left view. The refined left disparity map and left intensity image are used to reconstruct a fake right intensity image. Finally, the fake and real right intensity images (from the right stereo vision camera) are fed into the discriminator. In the model, the refiner is trained to output a refined disparity value close to the weighted sum of the disparity inputs for global initialisation. Then, three refinement principles are adopted to refine the results further. (1) The reconstructed intensity error between the fake and real right intensity image is minimised. (2) The similarities between the fake and real right image in different receptive fields are maximised. (3) The refined disparity map is smoothed based on the corresponding intensity image. The adversarial networks' architectures are effective for the fusion task. The fusion time using the proposed network is small. The network can achieve 90 fps using Nvidia Geforce GTX 1080Ti on the Kitti2015 dataset when the input resolution is 1242 * 375 (Width * Height) without downsampling and cropping. The accuracy of this work is equal to (or better than) the state-of-the-art supervised methods.



There are no comments yet.


page 3

page 9

page 11

Code Repositories


UDFNet: Unsupervised Disparity Fusion with Adversarial Networks

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the popularity of the disciplines related to 3D vision (eg: robotics, augmented and mixed reality, autonomous driving), how to get more accurate depth information using cheap devices in a 3D environment is important. Currently, there are many methods to obtain depth information, such as active illumination devices (eg: structured light cameras, Time of Flight (ToF) sensors), passive methods (monocular vision [1], stereo vision [2, 3, 4, 5]) and so on. However, none of these methods is perfect in all scenes. For example: ToF is sensitive to sunshine outdoors and reflectivity of the materials; Vision-based methods are sensitive to scene content (repetitive or textureless regions); Lidar-based devices are expensive and data is sparse and lacks color information. Thus, depth fusion from multiple sources is urgently needed, where different data sources can compensate for the weaknesses of each other.

In recent years, different kinds of depth fusion methods have emerged in different sub-tasks, such as stereo-ToF fusion ([6, 7, 8]), stereo-stereo fusion ([9]), Lidar-stereo fusion ([10, 11]) and general depth fusion ([12]). Additionally, deep-learning based methods perform much better than the rest. However, all of these algorithms are supervised and require much ground truth depth data to train. They suffer from two big problems: (1) Ground truth depth data is hard to get and expensive. (2) It is hard to generalize well with a limited amount of labeled data. As far as I know, the proposed algorithm in this paper is the first to develop a fully unsupervised depth fusion method, which solves the problems above and can fuse the depth inputs of different quality effectively without using ground truth depth data.

Unsupervised disparity fusion is hard because the algorithm requires to be able to produce an extremely accurate disparity map without any ground truth disparity data. The existing unsupervised strategy based on the left and right intensity consistency cannot guarantee a highly accurate disparity map. For example, Monodepth [1]

treated the left-right intensity consistency error as a global metric in their cost function to obtain the disparity value but slight intensity changes in the images will influence the global estimation strongly. However, the left-right intensity consistency is just one of our local refinement metrics, which increases the global robustness and accuracy in turn. Previous work, such as our Sdf-MAN 

[12], achieved top disparity fusion performance but these algorithms need the ground truth disparity data to train. By combining the global disparity initialization with local disparity refinement, we show that our network can be trained without any ground truth depth data.

In this paper, a fully unsupervised disparity fusion framework is proposed without the requirement of ground truth depth data. The initial disparity maps from the left view along with the auxiliary information (gradient, left & right intensity image) are input into the refiner (a network similar to the generator in GAN [13] but without noise input) and the refiner is made to output the refined disparity map registered to the left view. The refined left disparity map and left intensity image are used to reconstruct the fake right intensity image. Finally, the fake and real right intensity images (from the right stereo vision camera) are fed into the discriminator (See Fig. 1). In the model, the refiner is trained to output a refined disparity value close to the weighted sum of the disparity inputs for global initialization (Equation 1). Then, three refinement principles are adopted to refine the depth further. (1) The reconstructed intensity error between the fake and real right intensity image is minimized (Equation 2). (2) The similarities between the fake and real right image in different receptive fields are maximized (Equation 3). (3) The refined disparity map is smoothed based on the corresponding intensity image space (Equation 4).

A novel and efficient network structure has been designed as well (See Figure 2 for refiner, Figure 3 for discriminator). In the refiner, from the input layer to the bottleneck, the dense blocks and transition layers [14]

are used to increase local non-linearity to obtain more local detailed information. Long skip connections from previous layers to later layers are added to preserve the lost detail information after the bottleneck. The discriminator outputs the probability of the input image (real right image or reconstructed right image) being from the real distribution at different receptive field sizes (or different scales). Also, the dense blocks and transition layers have been adapted to increase the ability of the discriminator.

Section 2 presents the pipeline of the proposed work, the mathematical model used in the design of the objective function for the network and the architecture for the refiner and discriminator. Section 3 presents the experimental results.


  1. An efficient unsupervised strategy by combining global disparity initialization and local refinement

  2. An indirect method using an adversarial network to force the disparity Markov Random Field of the refined disparity map to be close to the real

  3. An unsupervised end-to-end uncertainty-based pipeline to fuse any disparity input

2 Methodology

First a pipeline using a refiner network (similar to the generator in GAN [13] without noise input) is proposed to realize disparity fusion and then a mathematical model is built to design the cost function for the fully unsupervised network. Finally, the refiner and discriminator architectures are introduced.

2.1 Fusion Pipeline

Similar to [12], a refiner network has been used to map coarse disparity inputs to a real disparity distribution deterministically using multi-modal information (disparity, intensity, intensity gradient). Unlike  [9, 12, 8]

, a fully unsupervised method is used to train the neural network without ground truth. Inspired by

[1], the disparity output from the left view is treated as a hidden layer and used to reconstruct the intensity image of the right view. The whole process is shown in Figure 1.

(a) Refiner
(b) Reconstruct right intensity image
(c) Negative examples
(d) Positive examples
Figure 1: (a): The inputs to the refiner (R) are the initial disparity maps (‘disparity map 1’, ‘disparity map 2’) in the left view and auxiliary information (left intensity image, right intensity image, right gradient image). The gradient of the left view is calculated from the left intensity image directly. The refiner produces a refined disparity map in the left view. (b): The refined left disparity map and left intensity image are used to reconstruct the right intensity image. (c)(d): The input to discriminator (D) is the combination of the left intensity image, right gradient image, the initial disparity inputs and the reconstructed/real right image. The discriminator tries to discriminate whether the input is fake (reconstructed right image) in (c) or true (real right image) in (d). The images in each block come from the training process on a synthetic garden dataset.

2.2 Objective Function

Using the reconstructed intensity error as a metric usually cannot guarantee a highly precise disparity output (eg: [1]), especially in environments with similar color and repetitive patterns (eg: garden). Additionally, the fusion accuracy may decrease compared with highly accurate disparity inputs if only the reconstructed intensity error is used as the decision metric. Thus, a more complex mathematical model is proposed as the cost function for disparity fusion.

The main ideas for the mathematical model:

The target is disparity fusion, whose accuracy should depend on the input disparity accuracy. The initial disparity maps should be used to provide the global initial value for the refined disparity map first. That is, the output of the refiner is encouraged to be similar to the input disparities (Equation 1).

The initialization based on Equation 1 can provide a coarse disparity map. The refinement will be realized by three local decision strategies discussed below together. The fake right intensity image is reconstructed from the left intensity image and disparity map. Thus, the accuracy of the refined disparity map can be assessed indirectly by comparing the reconstructed right image and real right image. The intensity error is designed based on the gradient in Equation 2 and the distance between the Markov Random Field of the refined disparity map and real disparity distribution is described in Equation 3

indirectly. A disparity smoothness term is also designed to reduce the outliers and strong noise in Equation 

4 using the gradient.

More specifically, the cost function has been designed as the following:

(1) Different from classical stereo vision algorithms without initial disparity inputs, disparity fusion is aimed at refinement of initial disparity inputs. Thus, a constraint is added that the output should be close to the weighted sum of the initial disparity inputs at each pixel. The output accuracy will increase with the accuracy enhancement of the initial disparity inputs.


In Equation 1, is the disparity value of a pixel of the initial disparity input (In Fig. 1, it is ‘disparity map 1’ or ‘disparity map 2’) corresponding to in the refined disparity output (In Fig. 1, it is ‘refined disparity map’). represents the distribution of the samples from the refiner. is the confidence of the pixel from the initial disparity input. If no prior knowledge is available, for all pixels. is the number of initial disparity inputs.

(2) To encourage disparity estimates at edges to be more accurate, gradient information is integrated as a weight into the distance to make the disparity edges clearer.


In Equation 2, is the real right intensity image from the right camera (In Fig. 1, it is ‘right intensity image’.) and is the reconstructed right intensity image from the refiner (In Fig. 1, it is ‘reconstructed right intensity image’). is the gradient of the gray image in the right view (In Fig. 1, it is ‘right gradient image’ calculated from Sobel operator). weights the gradient value. is distance. represents the distribution of the samples reconstructed from the left intensity image and corresponding refined disparity map. represents the distribution of the samples from the right camera in the stereo vision setting. The goal is to encourage disparity estimates at edges (larger gradients) to be more accurate with less reconstructed intensity error.

(3) The right intensity image is reconstructed from the left intensity image using the refined disparity output. Unlike [12], the reconstructed right intensity image and real right intensity image are input into the discriminator in this paper, which also gives indirect feedback about whether the refined disparity distribution is close to the ground truth. By making the discriminator output the probabilities at different receptive fields or scales (In Fig. 3, please refer to D1, D2,..,D5), the refiner will be forced to make the disparity distribution in the refined disparity map close to the real.

To alleviate training difficulties and avoid GAN mode collapse, the Improved WGAN loss function  

[15] is adopted. In Equation 3, is the probability at the scale that the input image patch to the discriminator is from the real distribution at the scale. is the penalty coefficient. is the random sample and is its corresponding distribution. (For more details, please read [15]).


(4) After the strategies above, the neural network may produce outliers (black holes) in the texture-less areas. In order to suppress the outliers and noise in the refined disparity map, a gradient-based smoothness term is used to propagate more accurate disparity values to the areas with similar color by the assumption that the disparity in the neighborhood should be similar if the intensity is similar. However, this term tends to produce blurry edge in the refined disparity map.


In Equation 4, is the disparity of a pixel in the refined disparity map from the refiner. is the disparity value of a pixel in the neighborhood of pixel . is the gradient in the left intensity image (the refined disparity map is produced on the left view) from pixel to pixel . It is calculated from the left intensity image considering the diagonal, left and right directions. and are responsible for how close the disparities are if the intensities in the neighborhood are similar.

(5) Finally, our object function for the fully unsupervised approach is:


In Equation 5, is the number of the scales (In Fig. 3, M=5, please refer to D1, D2, .., D5). , , , are the weights for the different loss terms.

2.3 Network architectures

A fully convolutional neural network 

[16] is adopted and also the partial architectures from [17, 14, 12] are adapted here for the refiner and discriminator. The refiner and discriminator use dense blocks [14] to increase local non-linearity. Transition layers [14]

change the size of the feature maps to reduce the time and space complexity. In each dense block and transition layer, modules of the form ReLu-BatchNorm-convolution are used. Two modules in the refiner and four modules in the discriminator in each dense block are used, where the filter size is 3

3 and stride is 1. In each transition layer, only one module is used, where the filter size is 4

4 and the stride is 2 (except that in Tran.3 of the discriminator the stride is 1).

Figure 2 shows the main architecture of the refiner. c1 initial disparity inputs (the experiments below use for 2 disparity maps) and c2 pieces of information (the experiments below use for the left intensity image, the right intensity image and a gradient of the right intensity image) are concatenated as input into the refiner. The batch size is b and resolution is 32m*32n. is the number of the feature map channels after the first convolution. The refined disparity map is treated as a hidden layer in the network and used to reconstruct the intensity image in the right view. To reduce the computational complexity and increase the extraction ability of local details, each dense block contains only 2 internal layers. Additionally, the skip connections [18] from the previous layers to the latter layers preserve the local details in order not to lose information after the network bottleneck. During training, a dropout strategy has been added into the layers in the refiner after the bottleneck to avoid overfitting and the dropout part is cancelled during testing to produce a deterministic result.

Figure 2:

This figure shows some important hyperparameters and the refiner architecture configuration. Please refer to Table 

1 for the specific values in each experiment.

Figure 3

is for the discriminator. The discriminator will only be used during training and abandoned during testing. Thus, the architecture of the discriminator will only influence the computational complexity during training. The initial disparity maps, information and real or reconstructed right images are concatenated and fed into the discriminator. Each dense block contains 4 internal layers. The sigmoid function (function tf.sigmoid in Tensorflow platform) outputs the probability map (

) that the input is real or fake at different scales to force the Markov Random Field of the refined disparity map to get closer to the real distribution at different receptive field sizes.

Figure 3: This figure shows some important hyperparameters and the discriminator architecture configuration. Please refer to Table 1 for the specific values in each experiment.

3 Experiments

The network is implemented using TensorFlow[19] and trained & tested using an Intel Core i7-7820HK processor (quad-core, 8MB cache, up to 4.4GHZ) and Nvidia Geforce GTX 1080Ti. First, an ablation study with initial disparity inputs ( [3, 4]) is conducted using a synthetic garden dataset to analyze the influence of each factor in the energy function. Secondly, a real test on the Kitti2015 dataset [20] is done with two initial inputs ( [2, 5]). All the results show the proposed algorithm’s superiority compared with the state-of-art or classical stereo vision algorithms ( [2, 4, 3, 5]), and the state-of-the-art stereo-stereo fusion algorithms ( [12, 9]).

In the following experiments, the inputs to the neural network were first padded to 32M * 32N (M, N are integers) using 0 and normalized to [-1, 1]. After that, the input was flipped vertically with a 50% chance to double the number of training samples. Weights of all the neurons were initialized from a Gaussian distribution (standard deviation 0.02, mean 0). Each model is trained for 100 epochs on a synthetic garden dataset

111It is not available to the public now. and 500 epochs on Kitti2015, with a batch size 4 using Adam [21] with a momentum of 0.5. The learning rate is changed from 0.005 to 0.0001 gradually. The method in [13] is used to optimize the refiner and discriminator by alternating between one step on the discriminator and then one step on the refiner. The parameters , , , in Equation 5 were set to make those four terms contribute differently to the energy function in the training process. If the difference of two initial disparity values on the same pixel is small (<0.3 pixels), a large value (0.99) is assigned to their confidence weight in Equation 1. If not, they were set to a medium value (0.5). Besides the confidence estimation above, some other special empirical confidence estimation for some disparity inputs were adopted in the following experiments (For more details, see the corresponding experiments). Additionally, I didn’t do any post-processing on the occlusion area and the other areas. The distance between the estimated value and ground truth is used as the error. The unit is pixel. For more details about the network settings and computational complexity, please see Table 1. To highlight the real test, the network can do the disparity fusion (up to 384*1248 pixels) directly at 90 HZ without any cropping or downsampling.

Ablation Study with Synthetic Garden Dataset
Para. Train time Test time b 32m 32n , lg ld
Value 0.22 (h/epoch) 0.010 (s/frame) 4 384 480 2, 3, 2 32 32 10 0.1 0.001 1 3 650 5
Real Test with Kitti2015 Dataset
Para. Train time Test time b 32m 32n , lg ld
Value 0.03 (h/epoch) 0.011 (s/frame) 4 384 1248 2, 3, 2 10 10 1 20 to 1 0.0001 to 1K 1 3 1 to 3000 5
Table 1: Computation Time and Initial Parameter Setting
Experiment Input1 [3] Input2 [4] Baseline
Error (pixel) 4.67 8.52 3.53 3.94 228.31 3.70 3.55 5.86 3.67 3.45
Tuning experience overall accuracy overall accuracy contrast influence smooth influence initialized disparity disparity MRF accuracy at edge smooth effect smooth effect
Table 2: Ablation Study on Each Cue (Unit: Pixel)

3.1 Ablation Study

(a) Scene
(b) Ground Truth
(c) FPGA Stereo [4]
(d) Dispnet [3]
(e) Fused Disparity
Figure 4: Two initial disparity inputs (c)(d) were fused to get a refined disparity map (e) using our baseline method on the synthetic garden dataset. (b) is the ground truth and (a) is the corresponding scene. Many, but not all, pixels from the fused result are closer to the ground truth than the original inputs. Tip: Zooming in the electronic version gives a better view.

The synthetic garden dataset contains 4600 training samples and 421 test samples under outdoor environments. Each sample has one pair of rectified stereo images and dense ground truth with resolution 480*640 (height * width) pixels. The reason using a synthetic dataset is that the real dataset (eg: Kitti2015) does not have dense ground truth, which will influence the evaluation of the network. Dispnet [3] and FPGA-stereo [4] were used as inputs. The authors of  [3, 4] helped us get the best performance on the dataset as the input to the network. Besides the confidence estimation strategy above, another special rule to Dispnet was added because the disparity map from Dispnet is very inaccurate in remote areas. That is, the pixels whose disparity is less than 4 (pixels) have a small confidence (0.1) and, otherwise, have a big confidence (0.9) because it is more accurate for close scenes. The default settings for the baseline network is shown in Table 1. In the ablation study, one of the following factors (, , , , , , ) in our energy function is changed to see the influence of each cue. The performance results are listed in Table 2. One example is Figure 4. As said in the introduction, the disparity constraint term (Equation 1) encourages the network to produce disparities close to the initial disparity inputs. It is the most important factor (228.31 pixels errors) because the other factors do not provide an accurate global initialization but mainly refine the disparity value in the local area using pixel-pixel intensity (,), smoothness (, , ) and Markov Random Field () . When tuning the network, the experience also followed the theory in the methodology section (Table 2).

To explore the influence of the input accuracy and the number of initial inputs, one or two initial disparity maps with different noise levels were input into the networks. The disparity inputs were got by adding different levels of Gaussian noise to the normalized ground truth (ground truth divided by 480). The confidence for each pixel in the experiments is proportional to the noise value under the assumption that accurate confidence estimation in the future can be obtained. The fusion accuracy (Table 3) increases with the number and accuracy of the disparity inputs. When there is only one input with , the proposed method still can refine the input disparity (0.77 <1 pixel ) because there are accurate confidence estimates. The proposed method sometimes fails to refine only one input disparity map when there are no confidence estimates (all confidences are 100%) and the mean error (<1 pixel) is too small. Otherwise, it works robustly and effectively.

Fuse one input Fuse two inputs
Noise std. input1 fused input1 input2 fused
0.77 0.69 0.77 0.77 0.64
1.53 1.13 1.53 1.53 0.71
3.06 1.58 3.06 3.06 0.80
6.12 3.05 6.12 6.12 1.33
Table 3: Average error (pixel) on Synthetic Garden

3.2 Real Data

3.2.1 Stereo-stereo Fusion

The network was tested on the real Kitti2015 dataset, which used a Velodyne HDL-64E Lidar scanner to get the sparse ground truth and a 1242*375 resolution stereo camera to get stereo image pairs. The training dataset contains 400 unlabeled and labeled samples in all. There are another 400 samples in the test dataset. 50 samples from ‘000000_10.png’ to ‘000049_10.png’ in the Kitti2015 training dataset were used as our test dataset. 50 samples from ‘000050_50.png’ to ‘000099_10.png’ in the Kitti2015 training dataset were used as our validation dataset. The rest 700 samples were used as our training set. By flipping the training samples vertically, it doubled the number of training samples. The state-of-art stereo vision algorithm PSMNet [2] was used as one of our inputs. Their released pre-trained model222PSMNet [2]: https://github.com/JiaRenChang/PSMNet on the Kitti2015 dataset was used to get the disparity maps. A traditional stereo vision algorithm SGM [5] was used as the second input to the network. Because the sparsity of SGM is not so important, its parameters were set to produce more reliable disparity maps. Thus, big confidence values (0.8) were assigned to its valid pixels and 0 to its invalid pixels’ confidences. More specifically, the implementation (‘disparity’ function) from Matlab2016b was used. The relevant parameters are: ’DisparityRange’ [0, 160], ‘BlockSize’ 5, ‘ContrastThreshold’ 0.99, ‘UniquenessThreshold’ 70, ‘DistanceThreshold’ 2. The settings of the neural network are shown in Table 1. The proposed algorithm was compared with the state-of-the-art technique [12, 9] in stereo-stereo fusion and also stereo vision inputs [2, 5]. The proposed method was compared with the supervised method in Sdf-MAN [12]. The supervised method in Sdf-MAN was trained on our synthetic garden dataset first and then fine-tuned on the Kitti2015. 150 labeled samples from ‘00050_10.png’ to ‘000199_10.png’ in the initial training dataset were used for Sdf-MAN fine-tuning. Compared with SGM [5] (0.78 pixels) (This is a more accurate disparity but is calculated only using more reliable pixels. On average only 40% of the ground truth pixels are used. If all the valid ground truth are used to calculate its error, it is 22.13 pixels), the fused results are much more dense and accurate. The performance of the proposed algorithm (0.83 pixels) (See Table 4) is better than Sdf-MAN (1.17 pixels). The reason is because Sdf-MAN can’t generalize in the real environment well. However, the proposed algorithm is not affected by such problems because the unsupervised method can use the unlabeled data directly.

Source Error Fused Algorithm Error
SGM [5] PSMNet [2] DSF [9] Sdf-MAN [12] Ours
0.78 1.22 1.20 1.17 0.83
Table 4: Average error (pixel) on Kitti2015
(a) Ground Truth
(b) Scene
(c) Input Disparity 1: SGM [5]
(d) Input Disparity 1 Error: SGM [5]
(e) Input Disparity 2: PSMNet [2]
(f) Input Disparity 2 Error: PSMNet [2]
(g) Refined Disparity
(h) Refined Disparity Error
Figure 5: The unsupervised adversarial network was trained to fuse the initial disparity maps (c), (e) into a refined disparity map (g) for the same scene (b) from the Kitti2015 dataset [20]. (a) is the corresponding ground truth. (d), (f), (h) are the errors of (c), (e), (g). The colorbars (from blue to white) corresponds to 0 - 1.6 pixels and the lighter pixel have bigger error in (d), (f), (h).

For qualitative results, see Figure 5. The proposed method compensates for the weaknesses of the inputs and refines the initial disparity maps effectively. Compared with SGM [5], the fused results are much more dense and accurate. Compared with PSMNet, the proposed method preserves the details better (eg: mountain). But the proposed method fully fails in the sky. The reason is that the pixels (disparity = 0) are treated as invalid (confidence = 0) in SGM. However, the disparity values of the pixels in the sky area from PSMNet are all larger than 0 (confidence >0). So, the PSMNet misleads the network to adopt their disparity value as the initialization. Thus, the wrong confidence measurement can bring big error to the refined disparity map. The problem can be solved by adding more cues, such as semantic meaning, to make the confidence measurement more accurate.

The efficiency and effectiveness of the proposed network’s structure on the Kitti2015 dataset have been explored. The same settings from Table 1 except the channel (, ) of the features in each layer were used. The results are shown in Table 5. When , our network achieves the best accuracy (0.83 pixels) compared with the rest. Additionally, the four groups of experiments with different achieve similar performance, which shows the robustness of the proposed method.

Structure Error (pixel) Time (s/frame)
1.00 0.009
0.83 0.011
0.92 0.014
0.87 0.017
Table 5: Average error (pixel) on Kitti2015

3.2.2 Stereo-Lidar Fusion

Our network can generalize to any sub-fusion task based on left-right consistency. A stereo-lidar fusion demo on Kitti2015 has also been done. The valid points in the initial ground truth were removed by half randomly to get the Lidar input to our network. The confidence of each valid Lidar data point was set as extremely large (100%). PSMNet was used as the stereo input again and set the confidence of each pixel as 50%. One hundred labeled images in the initial Kitti2015 training dataset were used to train (from ‘000100_10.png’ to ‘000199_10.png’ ) and the other 100 to test (from ‘000000_10.png’ to ‘000099_10.png’). The error of PSMNet [2] is 1.22 pixels and our error is 0.86 pixels after fusion.

4 Conclusion

We proposed an unsupervised method to fuse the disparity estimates of multiple state-of-art disparity/depth algorithms. The experiments have shown the effectiveness of the energy function design based on multiple cues and the efficiency of the network structure. The proposed network can be generalized to other fusion tasks based on left-right image consistency (In this paper, we only did stereo-stereo and stereo-lidar fusion). The work in this paper reduces the cost of acquiring labelled data necessary for use in a supervised method. Given the algorithm’s low computation cost, the combination of the proposed method and existing depth-acquisition algorithms is a good solution to obtaining high accuracy depth maps. Future work will investigate improved methods for setting the confidence values based on the initial disparity values and type of sensor.


  • [1] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, page 7, 2017.
  • [2] Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 5410–5418, 2018.
  • [3] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
  • [4] Dominik Honegger, Torsten Sattler, and Marc Pollefeys. Embedded real-time multi-baseline stereo. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 5245–5250. IEEE, 2017.
  • [5] Heiko Hirschmuller. Accurate and efficient stereo processing by semi-global matching and mutual information. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 807–814. IEEE, 2005.
  • [6] Carlo Dal Mutto, Pietro Zanuttigh, and Guido Maria Cortelazzo. Probabilistic tof and stereo data fusion based on mixed pixels measurement models. IEEE transactions on pattern analysis and machine intelligence, 37(11):2260–2272, 2015.
  • [7] Giulio Marin, Pietro Zanuttigh, and Stefano Mattoccia. Reliable fusion of tof and stereo depth driven by confidence measures. In European Conference on Computer Vision, pages 386–401. Springer, 2016.
  • [8] Gianluca Agresti, Ludovico Minto, Giulio Marin, and Pietro Zanuttigh. Deep learning for confidence information in stereo and tof data fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 697–705, 2017.
  • [9] Matteo Poggi and Stefano Mattoccia. Deep stereo fusion: combining multiple disparity hypotheses with deep-learning. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 138–147. IEEE, 2016.
  • [10] Will Maddern and Paul Newman. Real-time probabilistic fusion of sparse 3d lidar and dense stereo. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pages 2181–2188. IEEE, 2016.
  • [11] Kihong Park, Seungryong Kim, and Kwanghoon Sohn. High-precision depth estimation with the 3d lidar and stereo fusion. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2156–2163. IEEE, 2018.
  • [12] Can Pu, Runzi Song, Radim Tylecek, Nanbo Li, and Robert B Fisher. Sdf-man: Semi-supervised disparity fusion with multi-scale adversarial networks. Remote Sensing, 11(5):487, 2019.
  • [13] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [14] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [15] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5769–5779, 2017.
  • [16] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [17] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [18] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [19] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.

    Tensorflow: A system for large-scale machine learning.

    In OSDI, volume 16, pages 265–283, 2016.
  • [20] Moritz Menze, Christian Heipke, and Andreas Geiger. Joint 3d estimation of vehicles and scene flow. In ISPRS Workshop on Image Sequence Analysis (ISA), 2015.
  • [21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.