1 Introduction
Understanding the confidence level of the prediction is a critical part in deep learning. Since there are numerous parameters in a model, it is often hard to tell whether the trained model was making sensible predictions or just guessing at random. With recent engineering advances in the field of machine learning, systems that were only applied to toy data are now being deployed in reallife settings. Among these settings are scenarios in which control is handedover to automated systems, including automated decision making, recommendation systems in the medical domain, autonomous control of drones and self driving cars, as well as control of critical systems.
In particular, for deep regression tasks, stateoftheart performances [12, 8] are achieved with Deep Neural Networks (DNNs). However, without confidence measures, the predictions are often assumed to be accurate, which is not always the case. Taken the stereo matching problem as an example, the best trained model on KITTI Stereo 2015 [11] can reach up to an error rate of
measured by the percentage of outliers over all ground truth pixels. But still, there are
wrongly predicted pixels. If those pixels appear at some critical objects, such as a thin rail, it could be dangerous for depthassisted obstacle avoidance systems or advanced driverassistance systems. Moreover, worse or unreasonable predictions can be observed if tested on a different dataset or on degraded inputs.While confidence can be manually designed from a list of handcrafted rules for classical methods, it is not very straightforward to design such rules in deep learning. Both noisy data and unsuited model can lead to the degraded performance of stereo matching with deep learning. Particularly, noisy data includes the falsely labeled groundtruth, the noise corrupted input, or the illposed regions (occluded regions, repeated patterns, featureless regions, etc.). Unsuited model includes notwelldesigned network structure, notwelltrained model (underfitting or overfitting), etc..
In this paper, we start from a probabilistic interpretation of the loss used in stereo matching, which inherently assumes an independent and identical (aka i.i.d.) Laplacian distribution. Intuitively, there is a strong correlation between the variance in the Laplacian distribution and the confidence level for an arbitrary pixel. That is, the variance of the Laplacian distribution is large for low confident pixels while small for high confident pixels. By introducing the confidence as an additional variable with certain distribution parameterized by in our new formulation, we show that the identical distribution assumption is relaxed. Interestingly, this leads to a new loss function, where 1) the original loss is attenuated at low confident regions, reducing their influence to other pixels during backpropagation; 2) the confidence is penalized for being low with the confidence regularization term. The network structure is shown in Figure 1. In practice, the network learns to attenuate low confident pixels (e.g., noisy input, occlusions, featureless regions), as the attenuated loss produces relatively low cost. Meanwhile, the network focuses more on high confident pixels, as the confidence regularization term tradeoffs the cost. Moreover, by deploying the network to other stereo matching dataset, it can be observed from experiments that the focused learning is very helpful in finding a better convergence state of the trained model, reducing overfitting on a given dataset. Different with [6]
which implicitly treats the confidence distribution as uniform distribution, a special case when
, we also study how different confidence distributions would affect the performance of a model. The main contributions of this paper can be summarized as follows,
We propose a confidence inference module which does not require groundtruth confidence labels. The inferred confidence has its physical meaning, which can be employed to facilitate the decisionmaking or the postprocessing tasks.

We show that with the newly introduced confidence, the identical Laplacian distribution assumption is relaxed. Particularly, the variance of the Laplacian distribution is large for low confident pixels while small for high confident pixels.

We observe from experiments that the proposed method is very helpful in finding a better convergence state of the trained model, reducing overfitting on a given dataset.
2 Related Works
A taxonomy of confidence measures have been proposed by Hu and Mordohai [4]. They categorize 17 confidence prediction methods in stereo matching into 6 categories according to the cues exploited in confidence prediction. Moreover, they proposed an effective metric to assess the effectiveness of confidence prediction based on Area Under the Curve (AUC) analysis. Most recently, Poggi et al. [15] proposed an updated review and quantitative evaluation of 52 stateoftheart confidence prediction methods, including some of the deep learning methods. In this section, we review the most related works from a different viewpoint based on their methodology.
2.1 Confidence prediction with handcrafted cues
In this category, the confidence is predicted by some defined metric based on ones’ expert knowledge. Egnal et al. [1] proposed to use the negative cost as a simple confidence measure, so that large values correspond to higher confidence. They also proposed to utilize the shape of the cost curve around the minimum as an indication of confidence. Winner margin is proposed by Scharstein and Szeliski [17] which compares two smallest local minima of the matching cost. Leftright consistency proposed by Egnal et al. [1] is defined as the absolute difference between the disparity in the left image and the corresponding disparity in the right image. Seki and Pollefeys [18]
proposed to infer a confidence measure by processing features extracted from the left and right disparity maps. Distinctiveness
[9] can also be used as a cue for confidence prediction as did in Yoon et al. [23]. Their assumption is that distinctive points (e.g., edges and corners) are less likely to be falsely matched between reference and target images. Xu et al. [22] treat occlusion regions as low confident regions and exclude them in the stereo matching process for better performance.2.2 Confidence leanring with inference learning
The handcrafted cues applies mostly in classical stereo matching methods for confidence prediction. However, it is not an easy task to apply such rules directly in deep learning. In contrast, some regression or classification methods for confidence prediction are being proposed. In the perspective of machine learning, Haeusler et al. [3]
proposed a confidence prediction method by classifying feature vectors made by 23 handcrafted confidence measures. Similarly, Park
et al. [13] trained regression forests to predict the correctness (confidence) of a match by using selected confidence measures. Spyropoulos et al. [20]proposed to predict confidence by using a randomforest to classify a set of handcrafted features for each pixels. Meanwhile, in the domain of deep learning, Poggi and Mattoccia
[14] classified the predicted disparity map to confident patches and unconfident patches. Similarly, Shakedet al. [19] also proposed a reflective loss which is used to classify prediction of disparity to binary labels. Gurevich et al. [2]simultaneously trained two neural networks with a joint loss function. One of the networks performs predictions and the other simultaneously quantifies the uncertainty of predictions by estimating the locally averaged loss of the first one. Ummenhofer
et al. [21] proposed a supervised approach to regress a confidence, where the supervision is generated during training based on the disparity prediction error.2.3 Confidence prediction with probabilistic modeling
Another category of methods address the problem from the probabilistic point of view. Based on the work of Zhang and Shan [24]
which calculates costs on a similarity function by treating the value assigned to each potential disparity as a probability for the disparity. Hu and Mordohai
[4] achieved a simple confidence measure by normalized cost values, as they do not attempt to convert cost to an exact confidence. Recently, Kendall and Gal [6] proposed to learn epistemic uncertainty and aleatoric uncertainty for Bayesian Neural Networks, which relates to confidence measures.Our proposed method is a deep learning approach, and belongs to confidence prediction methods with probabilistic modeling. In particular, we assume that the predicted disparity have large variance when it is less confident, and smaller variance when it is more confident. By maximizing the likelihood of the predicted disparity and confidence, we derive a simple yet very effective loss to simultaneously learn disparity and its confidence.
3 Methodology
Deep learning is not perfect. It is often observed that a welltrained model on one dataset may fail easily on another. In order to make ourselves or the control systems aware of when and where the deployed model would fail, we intend to infer for stereo matching a dense confidence map for the predicted disparity.
3.1 What is Confidence?
Predicting a confidence is not straightforward, given the fact that a) the definition of confidence is subjective; b) there is no groundtruth available for a supervised confidence training. Thus, before estimating a dense confidence map, let’s first make it clear what is the targeted confidence that we want to generate,

The confidence should be high for correct regions and low for error regions.

The confidence values are in a range of .
3.2 Probabilistic Interpretation
To infer the confidence, let us start from a probabilistic interpretation of the loss used in stereo matching [10, 12], which inherently assumes an independent and identical (aka i.i.d.) Laplacian distribution. Let be the input of the network and be the predicted disparity map, where is the number of pixels in the input images. Usually, each pixels has a corresponding disparity value. Let be the model parameter. The optimal model parameter is found by maximizing the following likelihood function,
(1) 
Assuming the observed disparity values follow an identical Laplacian distribution,
(2) 
and the model parameter is independent of the input and follows another Laplacian distribution with zeromean and variance equals to one,
(3) 
Substituting (2) and (3) into (1), and take the negative loglikelihood,
(6)  
where is a scalar, corresponding to the weight decay during the process of model training. Conventionally, we add as a normalizer to rescale the loss function and gradient. Now the loss function is defined as,
(7) 
which corresponds to the commonly used loss function in stereo matching, with weight regularization.
3.3 Confidence Learning
Let us denote the confidence map as where
as a new random variable. Based on the confidence properties discussed in Section
3.1 that the confidence should be high for correct regions and low for error regions. We assume that the variance in the Laplacian distribution is large for low confident pixels and small for high confident pixels. For simplicity here, we set the variance as a linearly decreasing function of ,(8) 
where and are two positive constants satisfying , such that the variance always satisfies . With the newly introduced confidence , the likelihood function in (1) changes to,
(10)  
Intuitively, it is favored that the confidence follows a nondecreasing distribution. We will elaborate it more in Sec. 3.4
. For simplicity and here, let us define the probability density function of confidence as the following, though other nondecreasing function also applies.
(11) 
where . Take the negative loglikelihood of (10),
(12)  
(14)  
(16)  
Similarly, we multiply as normalizer and the new loss function is defined as,
(19)  
Compared to (7), there are two key differences. First, the loss changes to focused loss. For high confident pixels, where , the loss is unchanged. For low confident pixels, where , the loss is attenuated. Therefore, the first term focuses more on confident pixels. Second, a new regularization term called confidence regularization is introduced, which penalizes low confidences. Another The loss function is fully deferentialable respects to , thus the confidence will be learned inherently
In practice, as shown in Figure 1, we add at the end of the network an additional convolution followed by a Sigmoid layer to bound the output between and . The new loss function is used instead of the default loss.
3.4 Discussion on the formulation
Kendall et. al, [6] discussed this problem from a different point of view and resulted in similar loss formulations. However, there is a key difference that the formulation in [6] inherently assumes that the confidence follows a uniform distribution, corresponding to a special case with in our formulation.
Note that with the introduction of confidence, the loss defined in (19) is reweighted by the confidence. The optimal solution is achieved with high confidence at regions of accurate predictions and low confidence at regions of inaccurate predictions. For easy understanding, we plot sample loss curves in Figure 2, where the horizontal axis is the confidence, the vertical axis is the total loss, the two lines are two loss curves with for red curve, and for blue curve. Let us look at Figure 2(a) first, where . Clearly, for the red curve with large prediction error, the optimal loss is achieved at confidence value . For the blue curve with small prediction error, the optimal loss is achieved at confidence value . Next, let us look at Figure 2(b), where . For the red curve with the same large error as that in Figure 2(a), the optimal confidence is at . In this case, the network did not fully give up on these hard regions. It is good for achieving a good performance on the training dataset, but may result in an overfitted model if the network is impossible to infer correct values at those regions. Again, it is important to stress that the confidence learning is not an adhoc construction, but a consequence of maximum likelihood estimation (MLE).
4 Experiments
We present experiment results of our confidence learning approach on the stereo matching task. Firstly, we conduct ablationstudy to verify the effectiveness of the proposed confidence learning on Flyingthings3D dataset [10]. Then, we show that the confidence learning approach can help obtain a model with better generalization ability, when deploying the pretrained model on a synthetic dataset to a realworld dataset with different characteristics. Finally, we test our method on KITTI Stereo 2015 dateaset [11] to validate its effectiveness on the realworld scenario.
4.1 Evaluation Metrics
To evaluate the confidence map, we adopt the Receiver Operating Characteristic (ROC) curve and its Area Under the Curve (AUC) proposed by Hu et al. [4]. Specifically, we calculate the disparity error rate of the top confidence pixels (), where error rate is defined as the percentage of pixels with disparity error larger than pixels. Particularly, in our experiment, we set as suggested in [4] when plotting the ROC curve. The overall ROC curve is plotted with its values calculated from the entire dataset. AUC calculates the area under the ROC curve. Lower AUC indicates better ability of the confidence map to identify correct disparity pixels. Additionally, according to [4], for a given error rate at full density, there exists an optimal AUC, which is calculated as,
(20) 
Since the error rate at full density is different among the compared methods, it is not fair enough to compare their AUC scores directly. As done in [4], we further calculate the ratio between the average AUC score and the optimal AUC score to compare the confidence generated by different methods.
To evaluate the disparity map, the first metric that we use is the commonly adopted endpointerror (EPE), which is calculated as the absolute value between the predicted disparity and the groundtruth disparity. The second metric to evaluate the disparity map is the error rate, which is calculated as the percentage of disparities with their EPE larger than pixels. Note that when , the error rate is that in the ROC curve at full density.
4.2 Ablation Studies
We conducted several ablation experiments on FlyingThings3D [10] dataset to justify the proposed approach. FlyingThings3D is a synthetic dataset containing training image pairs and test image pairs. Dense groundtruth disparities are provided for both training and test images. In addition, there are large portion of hard regions (e.g., occlusions, featureless regions) in the dataset, which makes it efficient to evaluate the confidence inference approaches.
Methods  Ratio  EPE  1px Error  3px Error  5px Error  

Our Approach  0  0.0106  0.0478  0.2218  1.4436  0.1420  0.06488  0.04609 
0.5  0.0104  0.0520  0.2003  1.4387  0.1409  0.06423  0.04547  
1  0.0094  0.0588  0.1599  1.3945  0.1337  0.06077  0.04284  
2  0.0113  0.0612  0.1846  1.4322  0.1466  0.06418  0.04495  
5  0.0118  0.0782  0.1509  1.4274  0.1499  0.06431  0.04467  
DispFullNet [12]        1.6689  0.1960  0.07836  0.05235 
4.2.1 Model Architecture
We use DispFullNet [12] as our baseline model. The network architecture is similar to DispNetC [10] but generates disparity map at the original resolution. To estimate confidence map, we add a confidence inference module for each output scale to suit the multiscale disparity training scheme in DispFullNet. as shown in Figure 1. The output confidence map with largest resolution is used for evaluation.
All of our models are finetuned based on the pretrained weights given in [12]
, under the deep learning framework called CAFFE
[5]. All models are optimized using the Adam method [7] with =0.9, , and a batch size of 8. Multistep learning rate is adopted during training. Initially, the learning rate was set to , and then reduced by half at the kth, kth and kth iterations. The training was stopped at the kth iterations. The input images are randomly cropped to during training and resize to during testing.4.2.2 Parameters in Confidence Learning
As mentioned in Section 3.3, we assume the confidence follows a nondecreasing distribution, as described in (11). Particularly, we evaluate the performances by setting , and . Moreover, we set and in (8) in our experiment. We choose this formulation for several reasons. Firstly, linear function is a simple function so we do not introduce other complexity in the experiments. Secondly, by restricting into a finite range, the stability is guaranteed during optimization.
From the disparity error metrics (EPE, 1px Error, 3px Error, 5px Error) in Table 1, it can be observed that all the best disparity is achieved when . A possible explanation is that a relatively large prevents the aggressive diminishing of gradient on large error regions during back propagation. Thus in terms of disparity estimation, we observe decreased error in EPE, and consequently, lower error rate. However, larger (i.e. more weight on the confidence regularization) comes with a cost. It makes the model tends to assign highconfidence to reduce the confidence regularization and consequently, larger gradients may be included during back propagation.
From the confidence error metrics (ratio of to ) in Table 1, we observe that the best confidence inference result is achieved when . This can be explained by the previous observation that small can aggressively reject large error regions.
Finally, compared to the baseline model, our methods have significantly lower EPE and lower error rate with large margin (around relatively for all metrics), which suggests that our approach is able to help the model to reach a better convergence state.
Method  ratio  EPE  1px Error  3px Error  5px Error  

Ours  0  0.1103  0.2850  0.3870  6.6605  0.4366  0.2008  0.1449 
1  0.1154  0.3067  0.3763  6.8425  0.4402  0.2047  0.1457  
DispFullNet [12]        9.5944  0.4727  0.2686  0.2064 
4.3 Middleburry Dataset
We use Middleburry 2014 [16] dataset to test the generalization ability of our approach. We use model trained in Section 4.2 directly without further finetuning. Considering search range of the models and computation capacity, we resize the images to in our evaluations. The results are summarized in Table. 2. The ROC curves for and are presented in Fig. 6. Compared to original DispFullNet model, it can be observed that our approach achieves significantly lower error in terms of EPE and error rate. It indicates that our confidence learning approach reduces the overfitting to the given dataset and thus has better generalization ability. Meanwhile, notice that is better than on both of confidence measure (higher ratio) and disparity estimation (lower EPE). It indicates that the model is more robust. Some visual results are presented in Fig. 5. A possible explanation is that by focusing more on normal regions, the CNN is able to learn with less perturbation from noise, such as occlusion and textureless regions, which results in better generalization when the model is directly deployed to a different domain without any domain adaption techniques, such as finetuning.
4.4 KITTI Dataset
We applied our approach on the KITTI Stereo 2015 [11] Dataset as well. Note that the groundtruth of the dataset is sparse. There is no groundtruth information in some areas, such as edges of objects and occlusion regions, which is not preferred for confidence evaluating and training, as the important regions for confidence learning are greatly missing. Meanwhile, the fact that groundtruth is sparse prevent us from making evaluation on confidence map, thus only disparity is assessed in this section.
In detail, We deploy the models pretrained on solely FlyingThings3D [10] Dataset in two ways, without finetuning and with finetuning.
Method 

Without finetuning  With finetuning  
KITTI Val  KITTI Val  KITTI Test  
EPE  D1all  EPE  D1all  D1bg  D1fg  D1all  

    1.512  10.14  0.727  2.34  3.25  4.21  3.41  

No  0  1.440  8.38  0.703  2.23  3.01  6.78  3.65  
1  1.466  8.76  0.695  2.22  2.88  5.96  3.39  
Yes  0  1.313  7.68  0.671  2.01  2.93  4.97  3.27  
1  1.331  7.84  0.668  2.00  2.83  4.64  3.13 
4.4.1 Confidenceguided ensemble scheme
Theoretically, our approach enforce the model to focus more on highconfidence regions, so it is possible that the model makes bad prediction on lowconfidence regions, which downgrades the overall performance. To overcome the shortcoming and show that our confidence is reasonable, one straightforward solution is to adopt a confidenceguided ensemble scheme, in which we replace the prediction with lowconfidence by corresponding estimation from the baseline model (DispFullNet [12]). Specifically, the disparities of the least confidence are replaced in our experimentation. The evaluations are summarized in Table. 3
4.4.2 Experiment observation
Without finetuning: The model is directly deployed to the validation dataset. The input image is resized to . As a result, Our method produces better results than DispFullNet [12] without finetuning. Meanwhile, as shown in ”Without finetuning” column of Table. 3, outperforms sightly, which is consistent with our observations in Middleburry [16] Dataset in Sec. 4.3.
With finetuning: For finetuning experimentation, we set learning rate to for the first k iterations, then for another k iterations. The input images are randomly cropped to as data augmentation and a batch size of 6 during training. In the testing stage, the images are resized to . A visual example for is presented in Fig. 7. Interestingly, after finetuning, became better than . And in the test dataset, is even outperformed by DispFullNet [12]. It supports our formulation on confidence in Sec. 3.3. To train a good model, A proper value of is significant to determine to what extent the aggressive gradients should be diminished on hard regions.
With confidenceguided ensemble scheme: By adopting the confidenceguided ensemble scheme mentioned above, evident decrease is observed in EPE and D1all in all cases. It suggests that our method locates reasonable lowconfidence regions and meanwhile predicts better disparity on the high confidence regions.
5 Discussion and Conclusion
In this work, we propose a confidence inference method with probabilistic interpretation. We show that proper confidence can be inferred both analytically and experimentally. At the same time, the model can reach an even better convergence state. The inferred confidence can be employed to facilitate the decisionmaking or the postprocessing tasks. Though the proposed method is applied in stereo matching, we believe the same theory can be helpful and extended to other regression problem.
References
 [1] G. Egnal, M. Mintz, and R. P. Wildes. A stereo confidence metric using single view imagery with comparison to five alternative approaches. Image and vision computing, 22(12):943–957, 2004.
 [2] P. Gurevich and H. Stuke. Learning uncertainty in regression tasks by artificial neural networks. arXiv preprint arXiv:1707.07287, 2017.
 [3] R. Haeusler, R. Nair, and D. Kondermann. Ensemble learning for confidence measures in stereo vision. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 305–312. IEEE, 2013.
 [4] X. Hu and P. Mordohai. A quantitative evaluation of confidence measures for stereo vision. IEEE transactions on pattern analysis and machine intelligence, 34(11):2121–2133, 2012.
 [5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 [6] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems, pages 5580–5590, 2017.
 [7] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [8] Z. Liang, Y. Feng, Y. Guo, H. Liu, L. Qiao, W. Chen, L. Zhou, and J. Zhang. Learning deep correspondence through prior and posterior feature constancy. arXiv preprint arXiv:1712.01039, 2018.
 [9] R. Manduchi and C. Tomasi. Distinctiveness maps for image matching. In Image Analysis and Processing, 1999. Proceedings. International Conference on, pages 26–31. IEEE, 1999.
 [10] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
 [11] M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[12]
J. Pang, W. Sun, J. Ren, C. Yang, and Q. Yan.
Cascade residual learning: A twostage convolutional neural network for stereo matching.
In International Conf. on Computer VisionWorkshop on Geometry Meets Deep Learning (ICCVW 2017), volume 3, 2017.  [13] M.G. Park and K.J. Yoon. Leveraging stereo matching with learningbased confidence measures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 101–109, 2015.
 [14] M. Poggi and S. Mattoccia. Learning a generalpurpose confidence measure based on o (1) features and a smarter aggregation strategy for semi global matching. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 509–518. IEEE, 2016.
 [15] M. Poggi, F. Tosi, and S. Mattoccia. Quantitative evaluation of confidence measures in a machine learning world. In International Conference on Computer Vision (ICCV 2017), 2017.
 [16] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić, X. Wang, and P. Westling. Highresolution stereo datasets with subpixelaccurate ground truth. In German Conference on Pattern Recognition, pages 31–42. Springer, 2014.
 [17] D. Scharstein and R. Szeliski. Stereo matching with nonlinear diffusion. International journal of computer vision, 28(2):155–174, 1998.
 [18] A. Seki and M. Pollefeys. Patch based confidence prediction for dense disparity map. In In Proceedings of the 27th British Conference on Machine Vision, BMVC, 2016.
 [19] A. Shaked and L. Wolf. Improved stereo matching with constant highway networks and reflective confidence learning. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2017.

[20]
A. Spyropoulos and P. Mordohai.
Correctness prediction, accuracy improvement and generalization of stereo matching using supervised learning.
International Journal of Computer Vision, 118(3):300–318, 2016.  [21] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 5, 2017.
 [22] L. Xu and J. Jia. Stereo matching: An outlier confidence approach. In European Conference on Computer Vision, pages 775–787. Springer, 2008.
 [23] K.J. Yoon and I. S. Kweon. Distinctive similarity measure for stereo matching under point ambiguity. Computer Vision and Image Understanding, 112(2):173–183, 2008.
 [24] Z. Zhang and Y. Shan. A progressive scheme for stereo matching. In European Workshop on 3D Structure from Multiple Images of LargeScale Environments, pages 68–85. Springer, 2000.
Comments
There are no comments yet.