UDFNet: Unsupervised Disparity Fusion with Adversarial Networks
Existing disparity fusion methods based on deep learning achieve state-of-the-art performance, but they require ground truth disparity data to train. As far as I know, this is the first time an unsupervised disparity fusion not using ground truth disparity data has been proposed. In this paper, a mathematical model for disparity fusion is proposed to guide an adversarial network to train effectively without ground truth disparity data. The initial disparity maps are inputted from the left view along with auxiliary information (gradient, left & right intensity image) into the refiner and the refiner is trained to output the refined disparity map registered on the left view. The refined left disparity map and left intensity image are used to reconstruct a fake right intensity image. Finally, the fake and real right intensity images (from the right stereo vision camera) are fed into the discriminator. In the model, the refiner is trained to output a refined disparity value close to the weighted sum of the disparity inputs for global initialisation. Then, three refinement principles are adopted to refine the results further. (1) The reconstructed intensity error between the fake and real right intensity image is minimised. (2) The similarities between the fake and real right image in different receptive fields are maximised. (3) The refined disparity map is smoothed based on the corresponding intensity image. The adversarial networks' architectures are effective for the fusion task. The fusion time using the proposed network is small. The network can achieve 90 fps using Nvidia Geforce GTX 1080Ti on the Kitti2015 dataset when the input resolution is 1242 * 375 (Width * Height) without downsampling and cropping. The accuracy of this work is equal to (or better than) the state-of-the-art supervised methods.READ FULL TEXT VIEW PDF
UDFNet: Unsupervised Disparity Fusion with Adversarial Networks
With the popularity of the disciplines related to 3D vision (eg: robotics, augmented and mixed reality, autonomous driving), how to get more accurate depth information using cheap devices in a 3D environment is important. Currently, there are many methods to obtain depth information, such as active illumination devices (eg: structured light cameras, Time of Flight (ToF) sensors), passive methods (monocular vision , stereo vision [2, 3, 4, 5]) and so on. However, none of these methods is perfect in all scenes. For example: ToF is sensitive to sunshine outdoors and reflectivity of the materials; Vision-based methods are sensitive to scene content (repetitive or textureless regions); Lidar-based devices are expensive and data is sparse and lacks color information. Thus, depth fusion from multiple sources is urgently needed, where different data sources can compensate for the weaknesses of each other.
In recent years, different kinds of depth fusion methods have emerged in different sub-tasks, such as stereo-ToF fusion ([6, 7, 8]), stereo-stereo fusion (), Lidar-stereo fusion ([10, 11]) and general depth fusion (). Additionally, deep-learning based methods perform much better than the rest. However, all of these algorithms are supervised and require much ground truth depth data to train. They suffer from two big problems: (1) Ground truth depth data is hard to get and expensive. (2) It is hard to generalize well with a limited amount of labeled data. As far as I know, the proposed algorithm in this paper is the first to develop a fully unsupervised depth fusion method, which solves the problems above and can fuse the depth inputs of different quality effectively without using ground truth depth data.
Unsupervised disparity fusion is hard because the algorithm requires to be able to produce an extremely accurate disparity map without any ground truth disparity data. The existing unsupervised strategy based on the left and right intensity consistency cannot guarantee a highly accurate disparity map. For example, Monodepth 
treated the left-right intensity consistency error as a global metric in their cost function to obtain the disparity value but slight intensity changes in the images will influence the global estimation strongly. However, the left-right intensity consistency is just one of our local refinement metrics, which increases the global robustness and accuracy in turn. Previous work, such as our Sdf-MAN, achieved top disparity fusion performance but these algorithms need the ground truth disparity data to train. By combining the global disparity initialization with local disparity refinement, we show that our network can be trained without any ground truth depth data.
In this paper, a fully unsupervised disparity fusion framework is proposed without the requirement of ground truth depth data. The initial disparity maps from the left view along with the auxiliary information (gradient, left & right intensity image) are input into the refiner (a network similar to the generator in GAN  but without noise input) and the refiner is made to output the refined disparity map registered to the left view. The refined left disparity map and left intensity image are used to reconstruct the fake right intensity image. Finally, the fake and real right intensity images (from the right stereo vision camera) are fed into the discriminator (See Fig. 1). In the model, the refiner is trained to output a refined disparity value close to the weighted sum of the disparity inputs for global initialization (Equation 1). Then, three refinement principles are adopted to refine the depth further. (1) The reconstructed intensity error between the fake and real right intensity image is minimized (Equation 2). (2) The similarities between the fake and real right image in different receptive fields are maximized (Equation 3). (3) The refined disparity map is smoothed based on the corresponding intensity image space (Equation 4).
A novel and efficient network structure has been designed as well (See Figure 2 for refiner, Figure 3 for discriminator). In the refiner, from the input layer to the bottleneck, the dense blocks and transition layers 
are used to increase local non-linearity to obtain more local detailed information. Long skip connections from previous layers to later layers are added to preserve the lost detail information after the bottleneck. The discriminator outputs the probability of the input image (real right image or reconstructed right image) being from the real distribution at different receptive field sizes (or different scales). Also, the dense blocks and transition layers have been adapted to increase the ability of the discriminator.
Section 2 presents the pipeline of the proposed work, the mathematical model used in the design of the objective function for the network and the architecture for the refiner and discriminator. Section 3 presents the experimental results.
An efficient unsupervised strategy by combining global disparity initialization and local refinement
An indirect method using an adversarial network to force the disparity Markov Random Field of the refined disparity map to be close to the real
An unsupervised end-to-end uncertainty-based pipeline to fuse any disparity input
First a pipeline using a refiner network (similar to the generator in GAN  without noise input) is proposed to realize disparity fusion and then a mathematical model is built to design the cost function for the fully unsupervised network. Finally, the refiner and discriminator architectures are introduced.
Similar to , a refiner network has been used to map coarse disparity inputs to a real disparity distribution deterministically using multi-modal information (disparity, intensity, intensity gradient). Unlike [9, 12, 8]
, a fully unsupervised method is used to train the neural network without ground truth. Inspired by, the disparity output from the left view is treated as a hidden layer and used to reconstruct the intensity image of the right view. The whole process is shown in Figure 1.
Using the reconstructed intensity error as a metric usually cannot guarantee a highly precise disparity output (eg: ), especially in environments with similar color and repetitive patterns (eg: garden). Additionally, the fusion accuracy may decrease compared with highly accurate disparity inputs if only the reconstructed intensity error is used as the decision metric. Thus, a more complex mathematical model is proposed as the cost function for disparity fusion.
The main ideas for the mathematical model:
The target is disparity fusion, whose accuracy should depend on the input disparity accuracy. The initial disparity maps should be used to provide the global initial value for the refined disparity map first. That is, the output of the refiner is encouraged to be similar to the input disparities (Equation 1).
The initialization based on Equation 1 can provide a coarse disparity map. The refinement will be realized by three local decision strategies discussed below together. The fake right intensity image is reconstructed from the left intensity image and disparity map. Thus, the accuracy of the refined disparity map can be assessed indirectly by comparing the reconstructed right image and real right image. The intensity error is designed based on the gradient in Equation 2 and the distance between the Markov Random Field of the refined disparity map and real disparity distribution is described in Equation 3
indirectly. A disparity smoothness term is also designed to reduce the outliers and strong noise in Equation4 using the gradient.
More specifically, the cost function has been designed as the following:
(1) Different from classical stereo vision algorithms without initial disparity inputs, disparity fusion is aimed at refinement of initial disparity inputs. Thus, a constraint is added that the output should be close to the weighted sum of the initial disparity inputs at each pixel. The output accuracy will increase with the accuracy enhancement of the initial disparity inputs.
In Equation 1, is the disparity value of a pixel of the initial disparity input (In Fig. 1, it is ‘disparity map 1’ or ‘disparity map 2’) corresponding to in the refined disparity output (In Fig. 1, it is ‘refined disparity map’). represents the distribution of the samples from the refiner. is the confidence of the pixel from the initial disparity input. If no prior knowledge is available, for all pixels. is the number of initial disparity inputs.
(2) To encourage disparity estimates at edges to be more accurate, gradient information is integrated as a weight into the distance to make the disparity edges clearer.
In Equation 2, is the real right intensity image from the right camera (In Fig. 1, it is ‘right intensity image’.) and is the reconstructed right intensity image from the refiner (In Fig. 1, it is ‘reconstructed right intensity image’). is the gradient of the gray image in the right view (In Fig. 1, it is ‘right gradient image’ calculated from Sobel operator). weights the gradient value. is distance. represents the distribution of the samples reconstructed from the left intensity image and corresponding refined disparity map. represents the distribution of the samples from the right camera in the stereo vision setting. The goal is to encourage disparity estimates at edges (larger gradients) to be more accurate with less reconstructed intensity error.
(3) The right intensity image is reconstructed from the left intensity image using the refined disparity output. Unlike , the reconstructed right intensity image and real right intensity image are input into the discriminator in this paper, which also gives indirect feedback about whether the refined disparity distribution is close to the ground truth. By making the discriminator output the probabilities at different receptive fields or scales (In Fig. 3, please refer to D1, D2,..,D5), the refiner will be forced to make the disparity distribution in the refined disparity map close to the real.
To alleviate training difficulties and avoid GAN mode collapse, the Improved WGAN loss function is adopted. In Equation 3, is the probability at the scale that the input image patch to the discriminator is from the real distribution at the scale. is the penalty coefficient. is the random sample and is its corresponding distribution. (For more details, please read ).
(4) After the strategies above, the neural network may produce outliers (black holes) in the texture-less areas. In order to suppress the outliers and noise in the refined disparity map, a gradient-based smoothness term is used to propagate more accurate disparity values to the areas with similar color by the assumption that the disparity in the neighborhood should be similar if the intensity is similar. However, this term tends to produce blurry edge in the refined disparity map.
In Equation 4, is the disparity of a pixel in the refined disparity map from the refiner. is the disparity value of a pixel in the neighborhood of pixel . is the gradient in the left intensity image (the refined disparity map is produced on the left view) from pixel to pixel . It is calculated from the left intensity image considering the diagonal, left and right directions. and are responsible for how close the disparities are if the intensities in the neighborhood are similar.
(5) Finally, our object function for the fully unsupervised approach is:
A fully convolutional neural network is adopted and also the partial architectures from [17, 14, 12] are adapted here for the refiner and discriminator. The refiner and discriminator use dense blocks  to increase local non-linearity. Transition layers 
change the size of the feature maps to reduce the time and space complexity. In each dense block and transition layer, modules of the form ReLu-BatchNorm-convolution are used. Two modules in the refiner and four modules in the discriminator in each dense block are used, where the filter size is 3
3 and stride is 1. In each transition layer, only one module is used, where the filter size is 44 and the stride is 2 (except that in Tran.3 of the discriminator the stride is 1).
Figure 2 shows the main architecture of the refiner. c1 initial disparity inputs (the experiments below use for 2 disparity maps) and c2 pieces of information (the experiments below use for the left intensity image, the right intensity image and a gradient of the right intensity image) are concatenated as input into the refiner. The batch size is b and resolution is 32m*32n. is the number of the feature map channels after the first convolution. The refined disparity map is treated as a hidden layer in the network and used to reconstruct the intensity image in the right view. To reduce the computational complexity and increase the extraction ability of local details, each dense block contains only 2 internal layers. Additionally, the skip connections  from the previous layers to the latter layers preserve the local details in order not to lose information after the network bottleneck. During training, a dropout strategy has been added into the layers in the refiner after the bottleneck to avoid overfitting and the dropout part is cancelled during testing to produce a deterministic result.
is for the discriminator. The discriminator will only be used during training and abandoned during testing. Thus, the architecture of the discriminator will only influence the computational complexity during training. The initial disparity maps, information and real or reconstructed right images are concatenated and fed into the discriminator. Each dense block contains 4 internal layers. The sigmoid function (function tf.sigmoid in Tensorflow platform) outputs the probability map () that the input is real or fake at different scales to force the Markov Random Field of the refined disparity map to get closer to the real distribution at different receptive field sizes.
The network is implemented using TensorFlow and trained & tested using an Intel Core i7-7820HK processor (quad-core, 8MB cache, up to 4.4GHZ) and Nvidia Geforce GTX 1080Ti. First, an ablation study with initial disparity inputs ( [3, 4]) is conducted using a synthetic garden dataset to analyze the influence of each factor in the energy function. Secondly, a real test on the Kitti2015 dataset  is done with two initial inputs ( [2, 5]). All the results show the proposed algorithm’s superiority compared with the state-of-art or classical stereo vision algorithms ( [2, 4, 3, 5]), and the state-of-the-art stereo-stereo fusion algorithms ( [12, 9]).
In the following experiments, the inputs to the neural network were first padded to 32M * 32N (M, N are integers) using 0 and normalized to [-1, 1]. After that, the input was flipped vertically with a 50% chance to double the number of training samples. Weights of all the neurons were initialized from a Gaussian distribution (standard deviation 0.02, mean 0). Each model is trained for 100 epochs on a synthetic garden dataset111It is not available to the public now. and 500 epochs on Kitti2015, with a batch size 4 using Adam  with a momentum of 0.5. The learning rate is changed from 0.005 to 0.0001 gradually. The method in  is used to optimize the refiner and discriminator by alternating between one step on the discriminator and then one step on the refiner. The parameters , , , in Equation 5 were set to make those four terms contribute differently to the energy function in the training process. If the difference of two initial disparity values on the same pixel is small (<0.3 pixels), a large value (0.99) is assigned to their confidence weight in Equation 1. If not, they were set to a medium value (0.5). Besides the confidence estimation above, some other special empirical confidence estimation for some disparity inputs were adopted in the following experiments (For more details, see the corresponding experiments). Additionally, I didn’t do any post-processing on the occlusion area and the other areas. The distance between the estimated value and ground truth is used as the error. The unit is pixel. For more details about the network settings and computational complexity, please see Table 1. To highlight the real test, the network can do the disparity fusion (up to 384*1248 pixels) directly at 90 HZ without any cropping or downsampling.
|Ablation Study with Synthetic Garden Dataset|
|Para.||Train time||Test time||b||32m||32n||,||lg||ld|
|Value||0.22 (h/epoch)||0.010 (s/frame)||4||384||480||2, 3, 2||32||32||10||0.1||0.001||1||3||650||5|
|Real Test with Kitti2015 Dataset|
|Para.||Train time||Test time||b||32m||32n||,||lg||ld|
|Value||0.03 (h/epoch)||0.011 (s/frame)||4||384||1248||2, 3, 2||10||10||1||20 to 1||0.0001 to 1K||1||3||1 to 3000||5|
|Experiment||Input1 ||Input2 ||Baseline|
|Tuning experience||overall accuracy||overall accuracy||contrast influence||smooth influence||initialized disparity||disparity MRF||accuracy at edge||smooth effect||smooth effect|
The synthetic garden dataset contains 4600 training samples and 421 test samples under outdoor environments. Each sample has one pair of rectified stereo images and dense ground truth with resolution 480*640 (height * width) pixels. The reason using a synthetic dataset is that the real dataset (eg: Kitti2015) does not have dense ground truth, which will influence the evaluation of the network. Dispnet  and FPGA-stereo  were used as inputs. The authors of [3, 4] helped us get the best performance on the dataset as the input to the network. Besides the confidence estimation strategy above, another special rule to Dispnet was added because the disparity map from Dispnet is very inaccurate in remote areas. That is, the pixels whose disparity is less than 4 (pixels) have a small confidence (0.1) and, otherwise, have a big confidence (0.9) because it is more accurate for close scenes. The default settings for the baseline network is shown in Table 1. In the ablation study, one of the following factors (, , , , , , ) in our energy function is changed to see the influence of each cue. The performance results are listed in Table 2. One example is Figure 4. As said in the introduction, the disparity constraint term (Equation 1) encourages the network to produce disparities close to the initial disparity inputs. It is the most important factor (228.31 pixels errors) because the other factors do not provide an accurate global initialization but mainly refine the disparity value in the local area using pixel-pixel intensity (,), smoothness (, , ) and Markov Random Field () . When tuning the network, the experience also followed the theory in the methodology section (Table 2).
To explore the influence of the input accuracy and the number of initial inputs, one or two initial disparity maps with different noise levels were input into the networks. The disparity inputs were got by adding different levels of Gaussian noise to the normalized ground truth (ground truth divided by 480). The confidence for each pixel in the experiments is proportional to the noise value under the assumption that accurate confidence estimation in the future can be obtained. The fusion accuracy (Table 3) increases with the number and accuracy of the disparity inputs. When there is only one input with , the proposed method still can refine the input disparity (0.77 <1 pixel ) because there are accurate confidence estimates. The proposed method sometimes fails to refine only one input disparity map when there are no confidence estimates (all confidences are 100%) and the mean error (<1 pixel) is too small. Otherwise, it works robustly and effectively.
|Fuse one input||Fuse two inputs|
The network was tested on the real Kitti2015 dataset, which used a Velodyne HDL-64E Lidar scanner to get the sparse ground truth and a 1242*375 resolution stereo camera to get stereo image pairs. The training dataset contains 400 unlabeled and labeled samples in all. There are another 400 samples in the test dataset. 50 samples from ‘000000_10.png’ to ‘000049_10.png’ in the Kitti2015 training dataset were used as our test dataset. 50 samples from ‘000050_50.png’ to ‘000099_10.png’ in the Kitti2015 training dataset were used as our validation dataset. The rest 700 samples were used as our training set. By flipping the training samples vertically, it doubled the number of training samples. The state-of-art stereo vision algorithm PSMNet  was used as one of our inputs. Their released pre-trained model222PSMNet : https://github.com/JiaRenChang/PSMNet on the Kitti2015 dataset was used to get the disparity maps. A traditional stereo vision algorithm SGM  was used as the second input to the network. Because the sparsity of SGM is not so important, its parameters were set to produce more reliable disparity maps. Thus, big confidence values (0.8) were assigned to its valid pixels and 0 to its invalid pixels’ confidences. More specifically, the implementation (‘disparity’ function) from Matlab2016b was used. The relevant parameters are: ’DisparityRange’ [0, 160], ‘BlockSize’ 5, ‘ContrastThreshold’ 0.99, ‘UniquenessThreshold’ 70, ‘DistanceThreshold’ 2. The settings of the neural network are shown in Table 1. The proposed algorithm was compared with the state-of-the-art technique [12, 9] in stereo-stereo fusion and also stereo vision inputs [2, 5]. The proposed method was compared with the supervised method in Sdf-MAN . The supervised method in Sdf-MAN was trained on our synthetic garden dataset first and then fine-tuned on the Kitti2015. 150 labeled samples from ‘00050_10.png’ to ‘000199_10.png’ in the initial training dataset were used for Sdf-MAN fine-tuning. Compared with SGM  (0.78 pixels) (This is a more accurate disparity but is calculated only using more reliable pixels. On average only 40% of the ground truth pixels are used. If all the valid ground truth are used to calculate its error, it is 22.13 pixels), the fused results are much more dense and accurate. The performance of the proposed algorithm (0.83 pixels) (See Table 4) is better than Sdf-MAN (1.17 pixels). The reason is because Sdf-MAN can’t generalize in the real environment well. However, the proposed algorithm is not affected by such problems because the unsupervised method can use the unlabeled data directly.
|Source Error||Fused Algorithm Error|
|SGM ||PSMNet ||DSF ||Sdf-MAN ||Ours|
For qualitative results, see Figure 5. The proposed method compensates for the weaknesses of the inputs and refines the initial disparity maps effectively. Compared with SGM , the fused results are much more dense and accurate. Compared with PSMNet, the proposed method preserves the details better (eg: mountain). But the proposed method fully fails in the sky. The reason is that the pixels (disparity = 0) are treated as invalid (confidence = 0) in SGM. However, the disparity values of the pixels in the sky area from PSMNet are all larger than 0 (confidence >0). So, the PSMNet misleads the network to adopt their disparity value as the initialization. Thus, the wrong confidence measurement can bring big error to the refined disparity map. The problem can be solved by adding more cues, such as semantic meaning, to make the confidence measurement more accurate.
The efficiency and effectiveness of the proposed network’s structure on the Kitti2015 dataset have been explored. The same settings from Table 1 except the channel (, ) of the features in each layer were used. The results are shown in Table 5. When , our network achieves the best accuracy (0.83 pixels) compared with the rest. Additionally, the four groups of experiments with different achieve similar performance, which shows the robustness of the proposed method.
|Structure||Error (pixel)||Time (s/frame)|
Our network can generalize to any sub-fusion task based on left-right consistency. A stereo-lidar fusion demo on Kitti2015 has also been done. The valid points in the initial ground truth were removed by half randomly to get the Lidar input to our network. The confidence of each valid Lidar data point was set as extremely large (100%). PSMNet was used as the stereo input again and set the confidence of each pixel as 50%. One hundred labeled images in the initial Kitti2015 training dataset were used to train (from ‘000100_10.png’ to ‘000199_10.png’ ) and the other 100 to test (from ‘000000_10.png’ to ‘000099_10.png’). The error of PSMNet  is 1.22 pixels and our error is 0.86 pixels after fusion.
We proposed an unsupervised method to fuse the disparity estimates of multiple state-of-art disparity/depth algorithms. The experiments have shown the effectiveness of the energy function design based on multiple cues and the efficiency of the network structure. The proposed network can be generalized to other fusion tasks based on left-right image consistency (In this paper, we only did stereo-stereo and stereo-lidar fusion). The work in this paper reduces the cost of acquiring labelled data necessary for use in a supervised method. Given the algorithm’s low computation cost, the combination of the proposed method and existing depth-acquisition algorithms is a good solution to obtaining high accuracy depth maps. Future work will investigate improved methods for setting the confidence values based on the initial disparity values and type of sensor.
Tensorflow: A system for large-scale machine learning.In OSDI, volume 16, pages 265–283, 2016.