A Comparison of the Delta Method and the Bootstrap in Deep Learning Classification

07/04/2021 ∙ by Geir K. Nilsen, et al. ∙ 0

We validate the recently introduced deep learning classification adapted Delta method by a comparison with the classical Bootstrap. We show that there is a strong linear relationship between the quantified predictive epistemic uncertainty levels obtained from the two methods when applied on two LeNet-based neural network classifiers using the MNIST and CIFAR-10 datasets. Furthermore, we demonstrate that the Delta method offers a five times computation time reduction compared to the Bootstrap.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It can be beneficial to distinguish between epistemic and aleatoric uncertainty in machine learning models

[5]

. Bayesian statistics provides a coherent framework for representing epistemic uncertainty in neural networks

[9], but has not so far gained widespread use in deep learning [3] – presumably due to the high computational cost that traditionally comes with Fisher information based methods. In particular, the Delta method [4, 6] depends on the empirical Fisher information matrix which grows quadratically with the number of neural network parameters – and its direct application in modern deep learning is therefore prohibitively expensive. To mitigate this, [11] proposed a low cost variant of the Delta method applicable to -regularized deep neural networks based on the top eigenpairs of the Fisher information matrix.

In this paper, we validate the methodology introduced in [11] by a comparison with the classical Bootstrap [2, 6, 8, 13, 12]. We show that there is a strong linear relationship between the quantified epistemic uncertainty levels obtained from the two methods when applied on two LeNet-based neural network classifiers using the MNIST and CIFAR-10 datasets.

The paper is organized as follows: in Section 2 we review the Bootstrap and the Delta method in a deep learning classification context. In Section 3 we introduce two LeNet-based classifiers which will be used in the comparison in Section 4, and finally, in Section 5 we summarize the paper and give some concluding remarks.

2 Introduction to the Methodologies

In the following, we denote the training set by , the test set by and an arbitrary input example by

. The parameter space is denoted by the vector

, where is the number of parameters (weights and biases) in the model. The parameter values after training is denoted by the vector . Furthermore, a prediction for is denoted by where is a deep neural network model function [3] and where denotes the number of classes. Furthermore, it is assumed that the cost function denoted by is -regularized with a regularization-rate factor .

2.1 The Bootstrap in Deep Learning Classification

In the context of deep learning classification, the classical Bootstrap method starts by creating datasets from the original dataset by sampling with replacement. Subsequently, networks are trained separately on each of the bootstrapped datasets. The epistemic uncertainty for each of the

class predictions (in standard deviations) associated with prediction of

is obtained by the sample standard deviation over the ensemble of predictions,

(1)

where the vector represents the predictions for

(one probability per class) obtained from the

th bootstrapped network, and where is the sample mean,

(2)

The method is easy to implement efficiently in practice. Training networks is an ‘embarrassingly’ parallel problem, and the space complexity for the bootstrapped datasets is just when an indexing scheme is used for the sampling with replacement. The experiments conducted in this paper is based on the example pydeepboot.py provided in the pydeepdelta provision [14].

2.2 The Delta Method in Deep Learning Classification

The Delta method was adapted to the deep learning classification context by [11]. The adaption addresses several fundamental difficulties that arise when the method is applied in deep learning. In essence, it is shown that an approximation of the eigendecomposition of the Fisher information matrix utilizing only eigenpairs allows for an efficient implementation with bounded worst-case approximation errors. We briefly review the standard method here for convenience.

An approximation of the epistemic component of the uncertainty associated with the prediction of can be found by the formula

(3)

where the sensitivity matrix in (3) is defined

(4)

The covariance matrix in (3

) can be estimated by several alternative estimators. In

[11] it was demonstrated that the Hessian estimator, the Outer-Products of Gradients (OPG) estimator and the Sandwich estimator lead to nearly perfect correlated results for two different deep learning models. Since the models discussed in this paper are identical to those in [11], we thus focus only on one of the estimators, namely the OPG estimator defined by

(5)

where the summation part of corresponds to the empirical covariance of the gradients of the cost function evaluated at . As discussed in [11], the term is explicitly added in order to make the OPG estimator asymptotically equal to the Hessian estimator, as is the primary motivation for the former as a plug-in replacement of the latter in the first place.

When the Delta method is implemented under the framework of [11], it has several desirable properties: a) requires only space and time, b) fits well with deep learning software frameworks based on automatic differentiation, c) works with any -regularized neural network architecture, and d) does not interfere with the training process as long as the norm of the gradient of the cost function is approximately equal to zero after training.

3 The Neural Network Classifiers

We deploy two LeNet-based neural network architectures which differs only by the number of neurons in two of the layers in order to individually match the formats of the MNIST and CIFAR-10 datasets. Our TensorFlow code for the Delta method is based on the

pydeepdelta Python module [14], and is fully deterministic [10]. The corresponding Bootstrap implementation can be found in the same repository.

3.1 Mnist

There are layers, layer is the input layer represented by the input vector. Layer is a

convolutional layer followed by max pooling with stride equal to

and with a ReLU activation function. Layer

is a convolutional layer followed by max pooling with a stride equal to , and with ReLU activation function. Layer is a convolutional layer with ReLU activation function. Layer is a dense layer with ReLU activation function, and the output layer is a dense layer with softmax activation function, where the number of classes (outputs) is . The total number of parameters is .

3.2 Cifar-10

There are layers, layer is the input layer represented by the input vector. Layer is a convolutional layer followed by max pooling with stride equal to and with a ReLU activation function. Layer is a convolutional layer followed by max pooling with a stride equal to , and with ReLU activation function. Layer is a convolutional layer with ReLU activation function. Layer is a dense layer with ReLU activation function, and the output layer is a dense layer with softmax activation function, where the number of classes (outputs) is . The total number of parameters is .

3.3 Training Details

For the Bootstrap networks, we test two different weight initialization variants: dynamic random normal weight initialization (DRWI) and static random normal weight initialization (SRWI). The former uses a different (e.g. dynamic) seed across the replicates, meaning that each network in the DRWI Bootstrap ensemble will start out with different random weight values. The latter case uses the same (e.g. static) seed across the replicates, and hence all the networks in the SRWI Bootstrap ensemble receives the same random initial weight values. For all networks, we use zero bias initialization. Futhermore, to investigate the impact of random weight initialization on the Delta method, we apply the Delta method 16 times on a set of 16 networks distinguished only by DRWI.

We use the cross-entropy cost function with a -regularization rate , and utilize the Adam [7, 1] optimizer with a batch size of , and no form of randomized data shuffling. To ensure convergence (e.g. ), we apply two slightly different learning rate schedules given by the following (step, rate) pairs: MNIST = and CIFAR-10 = . For MNIST, we stop the trainings after steps, while for CIFAR-10, after steps – corresponding to the overall training statistics shown in Table 1.

Networks Dataset Training Set Accuracy Test Set Accuracy
DRWI Bootstrap B=100 MNIST
CIFAR-10
SRWI Bootstrap B=100 MNIST
CIFAR-10
Delta 16 reps (DRWI) MNIST
CIFAR-10
Table 1: Training statistics for the Delta and Bootstrap networks. The DRWI and SRWI Bootstrap ensembles each consists of bootstrapped networks, while the Delta method is applied repeatedly on 16 networks distinguished only by DRWI. Averages two standard deviations are calculated across the networks for the Bootstrap, and across the 16 repetitions for the Delta method.

4 Comparison

The basic comparison design entails a set of 16 linear regressions on the predictive uncertainty estimates obtained from the two methods using test sets as input data

(6)

Accounting for the two variants of the Bootstrap (SRWI/DRWI), this leads to two sets of squared correlation coefficients, intercepts, slopes and Delta method approximation errors, respectively denoted by . Furthermore, as we wish to analyze the impact of the number of Bootstrap replicates and the number of Delta method eigenpairs, we generate these sets for various and . An outline of the setup is shown in Figure 1.

Figure 1: Regression (6) of onto .

Figure 2 shows scatter plots of the regression results for the first repetition () of the Delta method against the DRWI Bootstrap ensemble. These plots are based on bootstrap replicates, and we have selected eigenpairs for MNIST and eigenpairs for CIFAR-10. Clearly, there is a strong linear relationship between the two methods: the squared correlation coefficients are for MNIST and for CIFAR-10. On the other hand, the absolute uncertainty level differs between the methods and datasets. This can be seen by the slope coefficients, where the Delta method is overestimating () on MNIST, and underestimating () on CIFAR-10. Further, since the estimated intercepts () are zero, there are no offsets between the methods. Finally, we see that the maximum across examples and class outputs of the Delta method approximation errors () are zero, so there is nothing to be achieved by increasing . As we will see later, has here been selected unnecessarily high and can be significantly reduced with no loss of accuracy.

(a) MNIST
(b) CIFAR-10
Figure 2: Predictive uncertainty estimates obtained from the Delta method (first repetition, ) against the DRWI Bootstrap for (a) MNIST using replicates and eigenpairs, and (b) CIFAR-10 using replicates and eigenpairs.

4.1 Discussion of the Regression Results as a Function of and

The results from the full set of regressions () holding a fixed are shown in Figure 3. The primary observations are as follows: The mean squared correlation coefficients are generally high for MNIST and CIFAR-10, meaning that there is a strong linear relationship between the uncertainty levels obtained by the Bootstrap and the Delta method. For the lowest , the starts out at 90% for MNIST, and at 81% for CIFAR-10. As grows, an increase by only % is observed for MNIST, while 8% for CIFAR-10. The major difference observed as increases lies in the absolute uncertainty levels expressed by the slope : for MNIST, the slope stabilizes at around while at about for CIFAR-10. The same trend is reflected in the maximum approximation errors , where we respectively see them approach zero at the same values for . Although not shown in the plots, the regression intercepts are always zero, meaning that there is no offset in the uncertainty estimates by the two methods.

(a) MNIST
Delta vs. SRWI Bootstrap
(b) MNIST
Delta vs. DRWI Bootstrap
(c) CIFAR-10
Delta vs. SRWI Bootstrap
(d) CIFAR-10
Delta vs. DRWI Bootstrap
Figure 3: Summaries of the regressions of onto as given by (6), for different values of and a fixed

. The solid lines and the associated confidence intervals represent the mean and the variation of the regression results across the 16 repetitions of the Delta method.

The main difference found from applying DRWI opposed to SRWI for the Bootstrap ensembles, is that the absolute level of uncertainty increases with DRWI. This is expected, since the DRWI version of the Bootstrap will be more prone to reaching different local minima, and therefore also captures this additional variance. Supporting evidence for this hypothesis is evident by CIFAR-10’s wider confidence intervals. A more pronounced geometry difference across various local minima will ultimately lead to higher variability in the

and . A slightly higher mean (+1-2%) is also observed for the DRWI version of the Bootstrap. This is reasonable given the fact that also the Delta method networks are more prone to reaching different local minima across the 16 repetitions because of DRWI.

Figure 4 shows the same type of comparison when the number of Bootstrap replicates varies, and the number of eigenpairs are fixed ( for MNIST and for CIFAR-10). The main observation from this experiment is that there is very little to achieve by selecting a larger ensemble size than about 50, as this is the point where the mean slope and squared correlation coefficient stabilizes.

(a) MNIST
Delta vs. SRWI Bootstrap
(b) MNIST
Delta vs. DRWI Bootstrap
(c) CIFAR-10
Delta vs. SRWI Bootstrap
(d) CIFAR-10
Delta vs. DRWI Bootstrap
Figure 4: Summaries of the regressions of onto as given by (6), for different values of and a fixed number of eigenpairs . The solid lines and the belonging confidence intervals represent the mean and the variation of the regression results across the 16 repetitions of the Delta method.

4.2 Computation Time

Table 2 shows the computation time for the two methods when executed on a Nvidia RTX 2080 Ti based GPU. For MNIST, the smallest leading to acceptable approximation errors and stable absolute uncertainty levels for the Delta method is at , while for CIFAR-10 the same applies at . Furthermore, the smallest acceptable leading to stable correlation and absolute uncertainty levels for the Bootstrap is at . We conclude that in these experiments the Delta method outperforms the Bootstrap in terms of computation time by a factor on MNIST, and a factor for CIFAR-10.

Method Classifier B K Initial Phase [h:mm:ss] Prediction Phase [mm:ss] Total [h:mm:ss]
Training Set Test Set
Bootstrap MNIST 50 N/A 4:08:28 00:19 00:03 4:08:50
CIFAR-10 7:37:16 00:40 00:07 7:38:04
Delta MNIST N/A 600 0:42:33 9:52 1:37 0:54:02
CIFAR-10 1000 1:00:54 14:44 02:56 1:18:35
Table 2: Computation time for the Bootstrap and Delta method. For the Bootstrap, the ‘initial phase’ accounts for the parallelized training of networks, while the ‘prediction phase’ accounts for the predictive epistemic uncertainty estimation (1), which is further divided into the training and test sets. For the Delta method, the ‘initial phase’ accounts for the approximate eigendecomposition of the covariance matrix (5), while the ‘prediction phase’ accounts for the predictive epistemic uncertainty estimation (3), further divided into the training set and test sets.

5 Concluding Remarks

We have shown that there is a strong linear relationship between the predictive epistemic uncertainty estimates obtained by the Bootstrap and the Delta method when applied on two different deep learning classification models. Firstly, we find that the number of eigenpairs in the Delta method can be selected order of magnitudes lower than with no loss of correspondence between the methods. This coincides with the fact that when the Delta method approximation errors are sufficiently close to zero, there is no nothing to achieve by a further increase in , and therefore the correspondence will stabilize at this point.

Secondly, we find that the DRWI version of the Bootstrap yields the best correspondence, and that there is little to achieve by using more than replicates. Thirdly, we observe that the most complex model (CIFAR-10) yields a high variability in the correspondence across multiple DRWI Delta method runs. We interpret this effect as caused by cost functional multi-modality, and that the Delta method fails to capture the additional variance tied to reaching local minima of different geometric characteristics. Finally, in our experiments we have seen that the Delta method outperforms the Bootstrap in terms of computation time by a factor on MNIST and by a factor for CIFAR-10.

References