Deep neural networks have achieved remarkable performance in various tasks, but have critical limitations in reliability of their predictions. One example is that inference results are often overly confident even for unseen or ambiguous examples. Since many practical applications including medical diagnosis, autonomous driving, and machine inspection require accurate uncertainty estimation as well as high prediction accuracy for each inference, such an overconfidence issue makes deep neural networks inappropriate to be deployed for real-world problems in spite of their impressive accuracy.
Regularization is a common technique in training deep neural networks to avoid overfitting problems and improve generalization performance (Srivastava et al., 2014; Huang et al., 2016; Ioffe & Szegedy, 2015). Although regularization is effective to learn robust models, its objective is not directly related to generating score distributions aligned with uncertainty of predictions. Hence, existing deep neural networks are often poor at calibrating prediction accuracy and confidence.
Our goal is to learn deep neural networks that are able to estimate uncertainty of each prediction while maintaining accuracy. In other words, we propose a generic framework to calibrate prediction score (confidence) with accuracy in deep neural networks. Our algorithm starts with an observation that variance of prediction scores measured from multiple stochastic inferences is highly correlated with accuracy and confidence of the prediction based on the average score. Based on its Bayesian interpretation, we employ stochastic regularization such as stochastic depth or dropout to obtain multiple stochastic inference results. By exploiting the empirical observation with the theoretical interpretation, we design a novel loss function to enable deep neural network to predict confidence-calibrated scores based only on a single prediction, without multiple stochastic inferences. Our contribution is summarized below:
We provide a generic framework to estimate uncertainty of a prediction based on stochastic inferences in deep neural networks, which is supported by empirical observations and theoretical analysis.
We propose a novel variance-weighted confidence-integrated loss function in a principled way, which enables deep neural networks to produce confidence-calibrated predictions even without performing stochastic inferences and introducing hyper-parameters.
The proposed framework presents outstanding performance to reduce overconfidence issue and estimate accurate uncertainty in various combinations of network architectures and datasets.
The rest of the paper is organized as follows. We first discuss prior research related to our algorithm, and describe theoretical background for Bayesian interpretation of our approach in Section 2 and 3, respectively. Section 4 presents our confidence calibration algorithm through stochastic inferences, and Section 5 illustrates experimental results.
2 Related Work
Uncertainty estimation is a critical problem in deep neural networks and receives growing attention from machine learning community. Bayesian approach is a common tool to provide a mathematical framework for uncertainty estimation in deep neural networks. However, exact Bayesian inference is not tractable in deep neural networks due to its high computational cost, and various approximate inference techniques—MCMC(Neal, 1996), Laplace approximation (MacKay, 1992) and variational inference (Barber & Bishop, 1998; Graves, 2011; Hoffman et al., 2013)–have been proposed. Recently, Bayesian interpretation of multiplicative noise is employed to estimate uncertainty in deep neural networks (Gal & Ghahramani, 2016; McClure & Kriegeskorte, 2016). There are several approaches outside Bayesian modeling, which include post-processing (Niculescu-Mizil & Caruana, 2005; Platt, 2000; Zadrozny & Elkan, 2001; Guo et al., 2017) and deep ensembles (Lakshminarayanan et al., 2017). All the post-processing methods require a hold-out validation set to adjust prediction scores after training, and the ensemble-based technique employs multiple models to estimate uncertainty.
Stochastic regularization is a common technique to improve generalization performance by injecting random noise to deep neural networks. The most notable method is Srivastava et al. (2014), which randomly drops their hidden units by multiplying Bernoulli random noise. There exist several variants, for example, dropping weights (Wan et al., 2013) or skipping layers (Huang et al., 2016). Most stochastic regularization methods exploit stochastic inferences during training, but perform deterministic inferences using the whole network during testing. On the contrary, we also use stochastic inferences to obtain diverse and reliable outputs during testing.
Although the following works do not address uncertainty estimation, their main idea is related to our objective more or less. Label smoothing (Szegedy et al., 2016)
encourages models to be less confident, by preventing a network from assigning the full probability to a single class. A similar loss function is discussed to train confidence-calibrated classifiers inLee et al. (2018), but it focuses on how to discriminate in-distribution and out-of-distribution examples, rather than estimating uncertainty or alleviating miscalibration of in-distribution examples. On the other hand, Pereyra et al. (2017) claims that blind label smoothing and penalizing entropy enhances accuracy by integrating loss functions with the same concept with Szegedy et al. (2016); Lee et al. (2018), but improvement is marginal in practice.
3 Bayesian Interpretation of Stochastic Regularization
This section describes Bayesian interpretation of stochastic regularization in deep neural networks, and discusses relation between stochastic regularization and uncertainty modeling.
3.1 Stochastic Methods for Regularizations
One popular class of regularization techniques is stochastic regularization, which introduces random noise to a network for perturbing its inputs or weights. We focus on the multiplicative binary noise injection, where random binary noise is applied to the inputs or weights by elementwise multiplication since such stochastic regularization techniques are widely used (Srivastava et al., 2014; Wan et al., 2013; Huang et al., 2016). Note that input perturbation can be reformulated as weight perturbation. For example, dropout—binary noise injection to activations—is intertpretable as weight perturbation that masks out all the weights associated with the dropped inputs. Therefore, if a classification network modeling with parameters is trained with stochastic regularization methods by minimizing the cross entropy, the loss function can be defined by
where is a set of perturbed parameters by elementwise multiplication with random noise sample , and is a pair of input and output in training dataset .
At inference time, the network is parameterized by the expectation of the perturbed parameters, , to predict an output , i.e.,
3.2 Bayesian Modeling
Given the dataset with examples, Bayesian objective is to estimate the posterior distribution of the model parameter, denoted by , to predict a label for an input , which is given by
A common technique for the posterior estimation is variational approximation, which introduces an approximate distribution and minimizes Kullback-Leibler (KL) divergence with the true posterior as follows:
The intractable integral and summation over the entire dataset in Equation 4 is approximated by Monte Carlo method and mini-batch optimization resulting in
where is a sample from the approximate distribution, is the number of samples, and is the size of a mini-batch. Note that the first term is data likelihood and the second term is divergence of the approximate distribution with respect to the prior distribution.
3.3 Interpreting Stochastic Regularizations as Bayesian Model
Suppose that we train a classifier with
regularization by a stochastic gradient descent method. Then, the loss function in Equation1 is rewritten as
where regularization is applied to the deterministic parameters with weight . Optimizing this loss function is equivalent to optimizing Equation 5 if there exists a proper prior and is approximated as a Gaussian mixture distribution (Gal & Ghahramani, 2016). Note that Gal & Ghahramani (2016) cast dropout training as an approximate Bayesian inference. Thus, we can interpret training with stochastic depth (Huang et al., 2016) within the same framework by simple modification. (See Appendix A and B for details.) Then, the predictive distribution of a model trained with stochastic regularization is approximately given by
denotes a score vector ofclass labels. Equation 8 means that the average prediction and its variance can be computed directly from multiple stochastic inferences.
4 Confidence Calibration through Stochastic Inference
We present a novel confidence calibration technique for prediction in deep neural networks, which is given by a variance-weighted confidence-integrated loss function. We present our observation that variance of multiple stochastic inferences is closely related to accuracy and confidence of predictions, and provide an end-to-end training framework for confidence self-calibration. Then, prediction accuracy and uncertainty are directly accessible from the predicted scores obtained from a single forward pass. This section presents our observation from stochastic inferences and technical details about our confidence calibration technique.
4.1 Empirical Observations
Equation 8 suggests that variation of models provides variance of multiple stochastic predictions for a single example. Figure 1 presents how variance of multiple stochastic inferences given by stochastic depth or dropout is related to accuracy and confidence of the corresponding average prediction, where the confidence is measured by the maximum score of the average prediction. In the figure, accuracy and score of each bin are computed with the examples belonging to the corresponding bin of the normalized variance. We present results from CIFAR-100 with ResNet-34 and VGGNet with 16 layers. The histograms illustrate the strong correlation between the predicted variance and the reliability—accuracy and confidence—of a prediction; we can estimate accuracy and uncertainty of an example effectively based on its prediction variances given by multiple stochastic inferences.
4.2 Variance-Weighted Confidence-Integrated Loss
The strong correlation of accuracy and confidence with predicted variance observed in Figure 1 shows great potential to make confidence-calibrated prediction by stochastic inferences. However, variance computation involves multiple stochastic inferences by executing multiple forward passes. Note that this property incurs additional computational cost and may produce inconsistent results.
To overcome these limitations, we propose a generic framework for training accuracy-score calibrated networks whose prediction score from a single forward pass directly provides confidence of the prediction. This objective is achieved by designing a loss function, which augments a confidence-calibration loss to the standard cross-entropy loss, while the two terms are balanced by the variance measured by multiple stochastic inferences. Specifically, our variance-weighted confidence-integrated loss for the whole training data
is defined by a linear interpolation of the standard cross-entropy loss with ground-truthand the cross-entropy with the uniform distribution , which is formally given by
where is a normalized variance, is a sampled model parameter with binary noise for stochastic prediction, is the number of stochastic inferences, and is a constant.
The two terms in our variance-weighted confidence-integrated loss pushes the network toward opposite directions; the first term encourages the network to fit the ground truth label while the second term forces the network to make a prediction close to the uniform distribution. These terms are linearly interpolated by an instance-specific balancing coefficient , which is given by normalizing the prediction variance of an example obtained from multiple stochastic inferences. Note that the normalized variance is distinct for each training example and is used to measure model uncertainty. Therefore, optimizing our loss function produces gradient signals, forcing the prediction toward the uniform distribution for the examples with high uncertainty derived by high variance while intensifying prediction confidence of the examples with low variance.
By training deep neural networks using the proposed loss function, we can estimate uncertainty of each testing example with a single forward pass. Unlike the ordinary models, a prediction score of our model is well-calibrated and represents confidence of the prediction, which means that we can rely more on the predictions with high scores.
4.3 Confidence-Integrated Loss
Our claim is that an adaptive combination of cross-entropy losses with respect to ground-truth and uniform distribution is a reasonable method to learn uncertainty. As a special case of our variance-weighted confidence-integrated loss, we also present a blind version of the combination, which can be used as a baseline uncertainty estimation technique. This baseline loss function is referred to as confidence-integrated loss, which is given by
where is the predicted distribution with model parameter and is a constant. The main idea of this loss function is to regularize with the uniform distribution by expecting the score distributions of uncertain examples to be flattened first while the distributions of confident ones remain intact, where the impact of the confidence-integrated loss term is controlled by a global hyper-parameter .
The proposed loss function is also employed in Pereyra et al. (2017) to regularize deep neural networks and improve classification accuracy. However, Pereyra et al. (2017) does not discuss confidence calibration issues while presenting marginal accuracy improvement. On the other hand, Lee et al. (2018) discusses a similar loss function but focuses on differentiating between in-distribution and out-of-distribution examples by measuring loss of each example based only on one of the two loss terms depending on its origin.
Contrary to the existing approaches, we employ the loss function in Equation 10 to estimate prediction confidence in deep neural networks. Although the confidence-integrated loss makes sense intuitively, blind selection of a hyper-parameter limits its generality compared to our variance-weighted confidence-integrated loss.
4.4 Relation to Other Calibration Approaches
There are several score calibration techniques (Guo et al., 2017; Zadrozny & Elkan, 2002; Naeini et al., 2015; Niculescu-Mizil & Caruana, 2005) by adjusting confidence scores through post-processing, among which Guo et al. (2017)
proposes a method to calibrate confidence of predictions by scaling logits of a network using a global temperature. The scaling is performed before applying the softmax function, and is trained with validation dataset. As discussed in Guo et al. (2017), this simple technique is equivalent to maximize entropy of the output distribution . It is also identical to minimize KL-divergence because
where is a constant. We can formulate another confidence-integrated loss with the entropy as
5.1 Experimental Setting and Implementation Details
We choose four most widely used deep neural network architectures to test our framework: ResNet (He et al., 2016), VGGNet (Simonyan & Zisserman, 2015), WideResNet (Zagoruyko & Komodakis, 2016) and DenseNet (Huang et al., 2017). We employ stochastic depth in ResNet as proposed in Huang et al. (2016) while employing dropouts (Srivastava et al., 2014) before every fc layer except for the classification layer in other architectures. Note that, as discussed in Section 3.3, both stochastic depth and dropout inject multiplicative binary noise to within-layer activations or residual blocks, they are equivalent to noise injection into network weights. Hence, training with regularization term enables us to interpret stochastic depth and dropout by Bayesian models.
We evaluate the proposed framework on two benchmarks, Tiny ImageNet and CIFAR-100. Tiny ImageNet containsimages with 200 object classes whereas CIFAR-100 has images of 100 object kinds. There are 500 training images per class in both datasets. For testing, we use the validation set of Tiny ImageNet and the test set of CIFAR-100, which contain 50 and 100 images per class, respectively. To test the two benchmarks with the same architecture, we resize images in Tiny ImageNet to .
All networks are trained with stochastic gradient decent with the momentum of 0.9 for 300 epochs. We set the initial learning rate to 0.1 and exponentially decay it with factor of 0.2 at epoch 60, 120, 160, 200 and 250. Each batch consists of 64 training examples for ResNet, WideResNet and DenseNet and 256 for VGG architectures. To train networks with the proposed variance-weighted confidence-integrated loss, we drawsamples of network parameters for each input image, and compute the normalized variance by running forward passes. The number of samples is set to 5. The normalized variance is estimated as Bhattacharyya coefficients between individual predictions and the average prediction.
5.2 Evaluation Metric
We measure classification accuracy and calibration scores—expected calibration error (ECE), maximum calibration error (MCE), negative log likelihood (NLL) and brier score—of the trained models. Let be a set of indices of test examples whose scores for the ground-truth labels fall into interval , where is the number of bins (). ECE and MCE are formally defined by
where is the number of the test samples. Also, accuracy and confidence of each bin are given by
where is an indicator function, and are predicted and true label of the -th example and is its predicted confidence. NLL and Brier score are another metrics for calibration and are defined as
where is the number of classes. We note that low values for all these calibration scores means that the network is well-calibrated.
|CI||50.09 1.08||0.134 0.079||0.257 0.098||2.270 0.212||0.665 0.037|
|CI||46.82 0.81||0.226 0.095||0.435 0.107||3.224 0.468||0.761 0.054|
|CI||55.80 0.44||0.115 0.040||0.288 0.100||1.980 0.114||0.594 0.017|
|CI||40.18 1.68||0.059 0.061||0.152 0.082||2.606 0.208||0.748 0.035|
|CI||77.56 0.60||0.134 0.131||0.251 0.128||1.064 0.217||0.360 0.057|
|CI||73.75 0.35||0.183 0.079||0.489 0.214||1.526 0.175||0.436 0.034|
|CI||77.35 0.21||0.133 0.091||0.297 0.108||1.062 0.180||0.356 0.044|
|CI||64.72 1.46||0.070 0.040||0.138 0.055||1.312 0.125||0.482 0.028|
, we present mean and standard deviation of results from models with five different’s. In addition, we also show results from the oracle CI loss, CI[Oracle], which are the most optimistic values out of results from all ’s in individual columns. Note that the numbers corresponding to CI[Oracle] may come from different ’s. Refer to Appendix C for the full results.
|ResNet-34||TS (case 1)||50.82||0.162||0.272||2.241||0.660|
|TS (case 2)||47.20||0.021||0.080||2.159||0.661|
|VGG-16||TS (case 1)||46.58||0.358||0.604||4.425||0.855|
|TS (case 2)||46.53||0.028||0.067||2.361||0.671|
|WideResNet-16-8||TS (case 1)||55.92||0.200||0.335||2.259||0.627|
|TS (case 2)||53.95||0.027||0.224||1.925||0.595|
|DenseNet-40-12||TS (case 1)||42.50||0.037||0.456||2.436||0.717|
|TS (case 2)||41.63||0.024||0.109||2.483||0.728|
|ResNet-34||TS (case 1)||77.67||0.133||0.356||1.162||0.354|
|TS (case 2)||77.40||0.036||0.165||0.886||0.323|
|VGG-16||TS (case 1)||73.66||0.197||0.499||1.770||0.445|
|TS (case 2)||72.69||0.031||0.074||1.193||0.389|
|WideResNet-16-8||TS (case 1)||77.52||0.144||0.400||1.285||0.361|
|TS (case 2)||76.42||0.028||0.101||0.891||0.332|
|DenseNet-40-12||TS (case 1)||65.91||0.095||0.165||1.274||0.468|
|TS (case 2)||64.96||0.082||0.163||1.306||0.481|
Table 1 presents accuracy and calibration scores for several combinations of network architectures and benchmark datasets. The models trained with VWCI loss consistently outperform the models with CI loss, which are special cases of VWCI, and the baseline on both classification accuracy and confidence calibration performance. Performance of CI is given by average and variance from 5 different cases of 111These 5 values of are selected favorably to CI based on our preliminary experiment. and CI[Oracle] denotes the most optimistic value among the 5 cases in each column. Note that VWCI presents outstanding results in most cases even when compared with CI[Oracle] and that performance of CI is sensitive to the choice of ’s (See Appendix C for details.). These results imply that the proposed loss function balances two conflicting loss terms effectively using variance of multiple stochastic inferences while performance of CI varies depending on hyper-parameter setting in each dataset.
We also compare the proposed framework with the state-of-the-art post-processing method, temperature scaling (TS) (Guo et al., 2017). The main distinction between post-processing methods and our work is the need for held-out dataset: our method allows to calibrate scores during training without additional data while Guo et al. (2017) requires held-out validation sets to calibrate scores. To illustrate the effectiveness of our framework, we compare our approach with TS in the following two scenarios: 1) using the entire training set for both training and calibration and 2) using 90% of training set for training and the remaining 10% for calibration. Table 2 presents that case 1 suffers from poor calibration performance and case 2 loses accuracy substantially due to training data reduction although it shows comparable calibration scores to VWCI. VWCI, on the other hand, presents consistently good results in terms of both classification accuracy and calibration performance.
Coverage of ResNet-34 models with respect to confidence interval on Tiny ImageNet (left) and CIFAR-100 (right). Coverage is computed by the portion of examples with higher accuracy and confidence than thresholds shown in-axis. To compare VWCI to CI[Oracle], we present results from multiple CI models with oracle ’s for individual metrics, which are shown in graph legends.
A critical benefit of our variance-driven weight in the VWCI loss is the capability to maintain examples with high accuracy and high confidence. This is an important property for building real-world decision making systems with confidence interval, where the decisions should be both highly accurate and confident. Figure 2 illustrates portion of test examples that have higher accuracy and confidence than varying thresholds in ResNet-34, where VWCI presents better coverage than CI[Oracle] by controlling weights of two loss terms effectively based on variance of multiple stochastic inferences. Note that coverage of CI often depends on the choice of significantly as demonstrated in Figure 2(right) while VWCI maintains higher coverage than CI using accurately calibrated prediction scores. These results imply that using the predictive uncertainty for balancing the loss terms is preferable to setting with a constant coefficient.
We presented a generic framework for uncertainty estimation of a prediction in deep neural networks by calibrating accuracy and score based on stochastic inferences. Based on Bayesian interpretation of stochastic regularization and our empirical observation results, we claim that variation of multiple stochastic inferences for a single example is a crucial factor to estimate uncertainty of the average prediction. Motivated by this fact, we design the variance-weighted confidence-integrated loss to learn confidence-calibrated networks and enable uncertainty to be estimated by a single prediction. The proposed algorithm is also useful to understand existing confidence calibration methods in a unified way, and we compared our algorithm with other variations within our framework to analyze their characteristics.
- Barber & Bishop (1998) D. Barber and Christopher Bishop. Ensemble learning for multi-layer networks. In NIPS, 1998.
Gal & Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.In ICML, 2016.
- Graves (2011) Alex Graves. Practical variational inference for neural networks. In NIPS, 2011.
- Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In ICML, 2017.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- Hoffman et al. (2013) Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2502581.2502622.
- Huang et al. (2016) Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In ECCV, 2016.
- Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, pp. 2261–2269, 2017.
- Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, pp. 6405–6416, 2017.
- Lee et al. (2018) Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. Training confidence-calibrated classifiers for detecting out-of-distribution samples. In ICLR, 2018.
David J. C. MacKay.
A practical bayesian framework for backpropagation networks.Neural Comput., 4(3):448–472, May 1992. ISSN 0899-7667. doi: 10.1162/neco.1922.214.171.1248. URL http://dx.doi.org/10.1162/neco.19126.96.36.1998.
- McClure & Kriegeskorte (2016) Patrick McClure and Nikolaus Kriegeskorte. Representation of uncertainty in deep neural networks through sampling. CoRR, abs/1611.01639, 2016.
- Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory F Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In AAAI, 2015.
- Neal (1996) Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, 1996. ISBN 0387947248.
Niculescu-Mizil & Caruana (2005)
Alexandru Niculescu-Mizil and Rich Caruana.
Predicting good probabilities with supervised learning.In ICML, 2005.
- Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.10, 06 2000.
- Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CVPR, 2016.
- Teye et al. (2018) Mattias Teye, Hossein Azizpour, and Kevin Smith. Bayesian uncertainty estimation for batch normalized deep networks. arXiv preprint arXiv:1802.06455, 2018.
- Wan et al. (2013) Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In ICML, 2013.
Zadrozny & Elkan (2001)
Bianca Zadrozny and Charles Elkan.
Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers.In ICML, 2001.
- Zadrozny & Elkan (2002) Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD, 2002.
- Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
Appendix A Stochastic depth as approximate Bayesian inference
ResNet (He et al., 2016) proposes to add skip connections to the network. If denotes the output of the layer and represents a typical convolutional transformation, we can obtain the forward propagation
and is commonly defined by
where and are weight matrices, () denotes convolution, and and
indicates batch normalization and ReLU function, respectively.
ResNet with stochastic depth (Huang et al., 2016) randomly drops a subset of layers and bypass them with short-cut connection. Let
denotes a Bernoulli random variable which indicates whether theresidual block is active or not. The forward propagation is extended from Equation 13 to
Now we can transform stochasticity from layers to the parameter space as follows:
since it is a Bernoulli random variable. All stochastic parameters in this block drop at once or not.
Appendix B Approximation of KL-Divergence
Let , and with a probability vector where and . In our work, denotes the deterministic model parameter and . The KL-Divergence between and is
We can re-parameterize the first entropy term with where .
Using for large enough ,
For the second term of the Equation 23,
Then we can approximate
For a more general proof, see Gal & Ghahramani (2016).
Appendix C Full Experimental Results