1. Introduction
Deep neural networks (DNNs) has achieved impressive performance and gained wide adoption in a variety of tasks in recent years. However, it is known that given the same model configuration and training dataset, DNNs can give different predictions for the same input example with slight differences introduced in training, from model parameters, data ordering, etc. (Lakshminarayanan et al., 2017; Ovadia et al., 2019; Chen et al., 2021; Wen et al., 2020b; D’Amour et al., 2020; Summers and Dinneen, 2021). As a result, it becomes more and more important to study prediction variation in neural network models (Jiang et al., 2018; Schulam and Saria, 2019). We use prediction variation (PV) to describe the variations of the prediction results on the given input example.
Ensemble and dropout are two popular methods for prediction variation estimation (Lakshminarayanan et al., 2017; Gal and Ghahramani, 2016), but both are expensive to deploy in practice. The ensemble approach (Lakshminarayanan et al., 2017) trains and serves multiple copies of the same model which are normally different on parameter initialization or training data order. These multiple copies or ensemble members produce a set of predictions for a given example and these predictions can be used to generate prediction variation. Different than ensemble, the dropout approach (Gal and Ghahramani, 2016; Laptev et al., 2017) trains on one copy of the model using a given dropout rate . During the inference time, the dropout approach requires running inference with the dropout rate multiple times to obtain a collection of predictions for prediction variation estimation. This step to obtain dropout prediction variation is expensive as as it requires multiple predictions during inference time to calculate the dropout prediction variation (DPV) for a given example.
Many researchers have worked on reducing the cost to estimate ensemble based prediction variation (ValdenegroToro, 2019; Malinin et al., 2019; Achrack et al., 2020; Wen et al., 2020a; Chen et al., 2021). For example, (Chen et al., 2021) demonstrates that it is possible to use neuron activation strength features to infer ensemble prediction variation as an auxiliary task. Neuron activation strength was used to represent DNN’s neuron output strength, e.g., the neuron output after activation. However, as far as we know, few work has been done to reduce the cost to estimate dropout prediction variation.
Our Goal — In this paper we study the problem of estimating dropout prediction variation in a cheaper way using neuron activation strength. We focus on reducing the serving cost for dropout prediction variation estimation for the following reasons.
First, the dropout approach is a cheaper and more practical solution to prediction variation estimation than the ensemble approach in real world applications. While (Chen et al., 2021) demonstrates it is possible to deploy the ensemble prediction variation estimation task as an auxiliary task, it still requires training ensembles of multiple model copies to generate the PV labels, in order to train the auxiliary task. On the other hand, the dropout approach simply trains one copy of the model using dropout, and thus requires less training resources, memory, and storage. As a result, it is cheaper to obtain the PV label using the dropout approach to train the auxiliary task.
Second, dropout prediction variation has a wide variety of realworld applications. For example, (Wen and Tadmor, 2020) uses DPV to detect untrustable configurations for molecular simulations to aid quantitative design of materials and devices. (Gal et al., 2017b)
uses DPV for the active learning applications, as active learning methods generally rely on an uncertainty score to learn and update models from small amounts of data.
(Laves et al., 2019) uses DPV to improve computeraided diagnoses and their robustness for patient safety.Finally, there are many different ways to construct ensemble and dropout is generally viewed as one type of ensemble (more discussion in Section 5.4). Therefore, despite their differences, both ensemble and dropout methods are used in a variety of downstream applications for PV estimation. In this paper we do not focus on studying the difference of PV estimated by ensembles vs dropout. Instead we aim to to provide a more costefficient way to estimate DPV given a successful downstream application using DPV.
Challenges — We face the following challenges:
Various tasks — Dropout prediction variation has a wide range of reallife application scenarios. Thus, we attempt to study the DPV estimation problem on a variety of tasks.
DPV estimation — There are many different ways to setup dropout prediction variation estimation, such as using different dropout rates, and whether the application or target task is trained using dropout. We attempt to study the DPV estimation problem of in all these different scenarios.
Activation strength features — It is demonstrated that using all the neuron activation strength features can infer ensemble PV (Chen et al., 2021). However, using all the activation features might be too expensive or infeasible in practice especially when the target task has a large neural network (He et al., 2016; Huang et al., 2017). As a result, we want to study whether it is possible to use a small subset of all the activation features to estimate DPV.
Our Approach — In this paper, we demonstrate that we are able to use neuron activation strength to infer dropout prediction variation on a variety of tasks and under different dropout settings. Our approach provides a more resource friendly way to estimate dropout prediction variation. As far as we know, we are the first trying to use neuron activation strength to infer dropout prediction variation, and the results are shown using three large datasets, MovieLens, Criteo, and EMNIST.
First, we set up the dropout prediction variation estimation framework using neuron activation strength on three datasets. The framework consists of two components: a) the target task represents the original prediction task; b) the variation estimation task, which takes in neuron activation strength features and produces the estimated dropout prediction variation given an example. We consider a variety of target tasks: the movie rating prediction task as a regression task or a multiclass classification task on the recommender system benchmark dataset MovieLens; Ads clickthrough rate prediction task as a binary classification task using Criteo; and image recognition task as a multiclass classification task using the benchmark digit recognition dataset EMNIST.
Second, we describe the dropout prediction variation estimation task in details for two configurations by considering whether we train the target task with dropout. We discuss how to collect input features and labels for these two configurations. We further discuss training and serving setup of the variation estimation model. Moreover, we propose additional resource savings by using activation features from a subset of layers for variation estimation.
Lastly, we describe in detail empirical experiments conducted and demonstrate our findings on above three large benchmark datasets. We show neuron activation strength features can be used to estimate dropout prediction variation on a variety of tasks and under different dropout settings. In almost all the settings, the estimated DPVs using neuron activation strength and the DPV labels show strong correlations. In particular, our approach is good at identifying whether an example belongs to the top and bottom buckets of the dropout prediction variation with an accuracy of 0.8 and even 0.9 in many settings. Moreover, we demonstrate that using a subset of the activation strength features from the target task can achieve comparable variation estimation performance, compared to using all the activation features.
Contributions — Our contributions include:

Introduce the neuron activation strength based framework to estimate dropout prediction variation (Section 3).

Propose two configuration setups for the dropout prediction variation estimation task by considering whether the target task is trained with dropout (Section 4).

Demonstrate that our neuron activation strength based DPV estimation approach is able to provide a resourceefficient way to estimate DPV on three large public datasets; Moreover, we demonstrate using a fraction of neuron activation as the DPV estimation features can achieve comparable performance to using all the neuron activations. (Section 5)
2. Related Work
Uncertainty estimation (Gawlikowski et al., 2021)
is an important topic in machine learning and has a wide range of applications
(Yang et al., 2015; Kahn et al., 2017; Feng et al., 2018; Laves et al., 2019; Atencia et al., 2020; Wen and Tadmor, 2020). For example, in (Yang et al., 2015), uncertainty sampling was employed to increase the diversity of selected training data. (Kahn et al., 2017)incorporated uncertainty into cost function for reinforcement learning and reduced the number of the robot’s dangerous collisions.
(EatonRosen et al., 2018) examined uncertainty estimation for brain tumor segmentation to calibrate volume estimates. One important type of uncertainty is due to model prediction variations or model prediction disagreements (EatonRosen et al., 2018; Fort et al., 2019; Gal and Ghahramani, 2016). We focus on this type of uncertainty in this work.A large volume of uncertainty estimation studies are under the Bayesian umbrella (Paisley et al., 2012; Hoffman et al., 2013; Kingma and Welling, 2013; Blundell et al., 2015)
. Direct Bayesian inference is usually impossible as posterior distributions over model parameters are generally intractable. Some classical Bayesian approaches, such as Markov Chain Monte Carlo, rely on sampling and generally do not scale well with large DNNs
(Papamarkou et al., 2019). Therefore, a lot of researches turn to approximate Bayesian inference methods, such as variational inference (Paisley et al., 2012; Hoffman et al., 2013; Kingma and Welling, 2013; Blundell et al., 2015). For example, Bayes by Backprop (Blundell et al., 2015) proposed to learn model weight distributions through variational Bayes.Ensemble is another class of benchmark methods for uncertainty estimation (Lakshminarayanan et al., 2017; ValdenegroToro, 2019; Wen et al., 2020a) and can be conceptually viewed as Bayesian as well (Wilson and Izmailov, 2020). Ensemble estimates uncertainty based on predictions of multiple copies of the trained neural networks. Although its effectiveness has been proved, the complexity to deploy and serve ensemble models makes it less practical in practice. Researchers have proposed various methods to reduce computational cost (ValdenegroToro, 2019; Wen et al., 2020a; Mariet et al., 2020; Chen et al., 2021; Havasi et al., 2020). For example, (ValdenegroToro, 2019) proposed to ensemble only the last several layers to approximate deep model ensemble, and (Wen et al., 2020a) proposed BatchEnsemble to reduce ensemble complexity by sharing weights among ensemble members. The most relevant work to our paper is (Chen et al., 2021). In particular, the authors proposed to use activation features to estimate ensemble prediction variation, which can be deployed as an auxiliary task during the inference time to save serving cost. However, the auxiliary task training procedure is still expensive: ensemble models are still required for training in order to provide labels for the auxiliary task to learn.
Compared to the ensemble method, uncertainty estimation using dropout is less resource intensive as it does not require training multiple ensemble members. There are abundant studies (Gal and Ghahramani, 2016; Gal et al., 2017a; Kendall et al., 2015) on dropout uncertainty estimation and its applications. For example, (Gal and Ghahramani, 2016) proposed to use Monte Carlo (MC) dropout to estimate model uncertainty and established the connection to Bayesian inference. Further, they extended dropout to work with convolutions (Gal and Ghahramani, 2015) based on the Bayesian connection. Concrete dropout (Gal et al., 2017a)
was introduced to automatically tune the dropout probability in large neural networks, avoiding costly searching for the optimal dropout hyperparameters.
(Mukhoti and Gal, 2018) evaluated MCdropout and Concrete dropout for uncertainty estimation on semantic segmentation with proposed metrics. (Mobiny et al., 2021)studied MCDropConnect and MCDropout as Bayesian methods in classification and segmentation settings with new uncertainty evaluation metrics. Further,
(Kendall et al., 2015) presented Bayesian SegNet with MCDropout to generate pixelwise uncertainty estimation and image semantic segmentation tasks. In this work, we focus on further reducing the cost of estimating dropout prediction variation. We propose using neuron activation strength to estimate dropout prediction variation which can significantly save serving resources.3. Variation Estimation Framework
Similar to the setup in (Chen et al., 2021) and as shown in Figure 1, the prediction variation estimation framework consists of two components: the target task and the variation estimation task. In this section, we introduce the datasets, and the setup of the two components.
3.1. Datasets
We use three datasets throughout the paper:

MovieLens (Harper and Konstan, 2015). The MovieLens 1M dataset^{1}^{1}1 http://files.grouplens.org/datasets/movielens/ml1mREADME.txt is a benchmark dataset for recommender systems evaluation. It features the task of using user and movie related features to predict movie ratings. This dataset contains 1M movie ratings from 6000 users on 4000 movies.

Criteo. The Criteo Display Advertising challenge ^{2}^{2}2https://www.kaggle.com/c/criteodisplayadchallenge features a binary classification task to predict Ads clickthrough rate: the label of clicked event is 1; otherwise 0. The Criteo data consists of around 40M examples with 13 numerical and 26 categorical features.
3.2. Target Tasks
As shown in Figure 1, the target task represents the original prediction problem, such as the rating prediction task on MovieLens, the clickthrough prediction task on Criteo, and the digit recognition task on EMNIST. For MovieLens and Criteo, the target task contains an embedding layer on top of the input features, while for EMNIST, the embedding layer is not necessary.
We define four target tasks on MovieLens, Criteo, and EMNIST, as follows. Note that all the target tasks use Batch Normalization
(Ioffe and Szegedy, 2015), and use ReLU as the activation function. We choose larger layer sizes for fully connected layers in the target tasks than
(Chen et al., 2021), to make it easier to apply dropout with different dropout rates on the neurons. We use the batch size of 1024 on MovieLens and Criteo tasks (Chen et al., 2021), and the batch size of 128 on EMNIST (Wan et al., 2013).MovieLens Regression (MovieLensR) — Similar to (Chen et al., 2021)
, the target task takes in userrelated features (i.e., id, gender, age, and occupation) and movierelated features (i.e., id, title and genres), and predicts movie rating as a regression task. The movie ratings are integers from 1 to 5. We use mean squared error (MSE) as the loss function.
We use multilayer perceptron (MLP) models with fully connected layer sizes [250, 100, 50] for the rating prediction task. Each model trains for 20 epochs to converge. We only use the observed ratings in MovieLens as training data. We set the user id and item id embedding size to 8
(He et al., 2017), the user age embedding size to 3, and user occupation embedding size to 5.MovieLens Classification (MovieLensC) — We use the same setting here as the MovieLensR task except for the prediction objective. We predict movie ratings as 5 integer values from 1 to 5 and model this problem as a multiclass classification task with Softmax cross entropy as the loss function.
Criteo — This target task uses a set of numerical and categorical features to predict the clickthrough rate. The label 1 represents the ads is clicked; 0 otherwise. We model this problem as a binary classification task with Sigmoid cross entropy loss function. The trained model outputs a float between 0 and 1 representing the predicted clickthrough probability.
We use multilayer perceptron (MLP) models of the layer sizes to be [250, 100, 50]. The models are trained for 1 epoch to converge.
EMNIST — This task is a digit recognition task. It takes a grayscale image representing a digit from 0 to 9, and predicts the digit in the image. We used the LeNet5 (LeCun et al., 1998) model as a 10category classification model: it is composed of two convolutional layers with feature map sizes
, each followed by a maxpooling layer, and two fullyconnected layers (FCL) of sizes
. It is trained for 50 epochs to converge.3.3. Variation Estimation Task
As shown in Figure 1, the variation estimation task takes in the activation strength features from all the neuron on the target task model, concatenates all the activation features and feeds into a DNN to estimate the prediction variation for each input example.
4. Dropout Variation Estimation Task
The variation prediction task model is trained with neuron activation strength features and dropout prediction variation (DPV) labels. The DPV labels are collected by running multiple inferences on the target task with dropout enabled. After the training is complete, we no longer need to make multiple dropout inferences for prediction variation estimation. Instead, we can simply let the trained variation prediction model to estimate the DPV using neuron activation strength. As a result, the variation estimation task serves as a cheap auxiliary task alongside the target task to provide the dropout prediction variation estimation given an input example.
In this section, we first introduce the variation estimation task setup of two configurations that mimic realworld deployment scenarios. Then we formally define the prediction variation labels and discuss the input features of neuron activation strength.
4.1. Variation Estimation Task Configurations
As shown in Figure 2, we consider two configurations for the dropout variation estimation task depending on whether the deployed target task is trained with dropout.

Configuration 1: represents the scenario when the deployed target task is trained with dropout enabled. Under this scenario, we train the target task with dropout, and during its inference time, we disable dropout and collect neuron activation values. These neuron activation values serve as input features to the variation estimation task. Prediction variation (PV) labels are collected by obtaining target task predictions multiple times with dropout enabled.

Configuration 2: represents the scenario when the deployed target task is not trained with dropout. This might happen in reallife applications. For example, many production models do not train with dropout enabled to achieve the best accuracy performance (Covington et al., 2016). Similar to Configuration 1, we collect the neuron activation values on the target task during its inference time as the input features to the variation estimation task. To obtain the labels for dropout prediction variations, we set up a side target task with dropout enabled to gather PV labels by drawing multiple predictions from the side target task with dropout enabled. After the training of the variation estimation task model is complete, the side target task in configuration 2 is no longer needed.
For all the datasets, we only enable neuron dropout on the fully connected layers. In particular, on EMNIST, we do not enable dropout on the convolutions layers. Note that we only consider the most typical dropout setup which is to apply dropout on fully connected layers (Gal and Ghahramani, 2016; Laptev et al., 2017) instead of less typical dropout setups, like embedding dropout (Yao et al., 2021).
4.2. Prediction Variation Labels
In this section, we define the labels for the dropout prediction variation estimation task. We collect the dropout prediction variation labels as follows:
Given a dropout rate , we train the target task or the side target task (as shown in Figure 2) with neuron dropout until it converges. Then, we run inference on the trained dropout model times (we use 100 here) with the same neuron dropout rate of to obtain a set of predictions for each example .
Following (Chen et al., 2021)
, for the binary classification task (i.e., Criteo), and the regression task (i.e., MovieLensR), we define the prediction standard deviation as the prediction variation. We formally define the prediction variation
for each example and its predictions , as follows.(1) 
where .
For the multiclass classification task (i.e., MovieLensC and EMNIST), the predictions for each input over the dropout inference runs are a set of prediction distributions , where each prediction distribution represents the prediction distribution on the target classes. We adopt the KL divergence based prediction variation (Chen et al., 2021), and formally define the prediction variation , as follows.
(2) 
where representing the averaged prediction distribution over dropout inference runs.
Datasets  Configuration 1  Configuration 2  

0.1  0.2  0.3  0.4  0.5  0.1  0.2  0.3  0.4  0.5  
EMNIST  0.513  0.574  0.554  0.542  0.456  0.450  0.494  0.445  0.441  0.307 
Criteo  0.935  0.944  0.946  0.944  0.942  0.839  0.853  0.859  0.862  0.850 
MovieLensC  0.879  0.900  0.912  0.922  0.925  0.477  0.545  0.595  0.640  0.667 
MovieLensR  0.853  0.900  0.931  0.945  0.949  0.575  0.630  0.722  0.753  0.771 
4.3. Neuron Activation Strength Features
We use ReLU (Nair and Hinton, 2010) as the activation function in target task models, and consider two types of input features on neuron activation strength.

Binary — This binary feature represents whether a neuron is activated, i.e. whether the neuron output is greater than 0.

Value — The raw value of a neuron’s output which directly represents the strength of an activated neuron. We normalize the neuron outputs according to the neuron output mean and standard deviation collected from the training data, and use the normalized activation value as the input feature.
Given the two types of features on neuron activation strength, we can simply take the features on all the neurons on the target task as the input feature for the variation estimation task. However, this simple approach could lead to a potential training problem on the variation estimation task. For example, the layer sizes of the EMNIST target task model is [3456, 1024, 120, 84]. Taking all the neurons’ activation strength as the input feature for the variation estimation task, it leads to 4684 raw activation values. If we use both raw and binary input features, the number will double. If we create a neural network of sizes on this 4684 features to estimate prediction variation, it would yield parameters for the variation estimation model. Unfortunately the total size of the EMNIST dataset is only 280k, which easily makes the variation estimation model overfit. Experiments using activations from all EMNIST layers achieves better performance than those using last three layers on the training set but lower performance on the validation set. Therefore, for EMNIST by default, only activations of fullyconnected layers (FCL) are used as variation estimation model input features.
As deep neural networks are getting larger and larger, the target task might still contain too many neurons and it can be a problem to use all the FCL neurons’ activation strength as input features for the variation estimation task. As a result, we are interested in further reducing the complexity and cost of variation estimation task by selecting a subset of neuron activation features. As different layers in a DNN reflect different representations of input data [42,51], it’s plausible that some layers may be more critical in capturing prediction variations. Therefore, one simple and natural way to select the subset of neurons is by layers. We demonstrate our findings in Section 5.3.
5. Experiments
In this section, we evaluate the performance of dropout prediction variation estimation using neuron activation strength.
5.1. Experiment Setup
In this section, we discuss how we set up the evaluation of our dropout prediction variation estimation task.
Training/Testing Data — We use the three datasets (mentioned in Section 3.1), including EMNIST (280K digit images), MovieLens (1M usermovie ratings) and Criteo (45.8M examples). We set up the training and testing data for the target task and the variation estimation task as follows:
First we build four target tasks on the three datasets as discussed in Section 3.2. We split the data into training and testing : We randomly divide EMNIST into 50% and 50% as training and testing, to make sure we have enough images (140K) for testing; For MovieLens, we did the 60% and 40% split, same to the setting in (Chen et al., 2021); For Criteo, we use the same setting as in (Ovadia et al., 2019): 37M training data, 4.4M validation data, and 4.4M testing data.
We further split into two parts as the training data and testing data for the variation estimation task. For all the three datasets, we simply randomly split the target task testing data into 50% and 50% as and respectively.
Evaluation — During the evaluation, we first use the data to train the target task. We use Adam optimizer with initial learning rate 0.01 for Criteo, 0.1 for EMNIST, and 0.001 for MovieLensC and MovieLensR. More detail of the target task model configurations is discussed in Section 3.2.
We then set up the variation estimation task evaluation of two configurations as discussed in Section 4. We use to train the variation estimation task. We collect the dropout prediction variation labels on from the target task or the side target task with dropout enabled for training and inference according to the configurations discussed in Section 4.1. We show the the target task accuracy performance under different dropout settings in Appendix A.1. Then we use to test the variation estimation performance. We repeat this process 20 times and report the averaged performance. All the models are trained for 50 epochs. We use Adam optimizer with initial learning rate 0.001, decaying by 0.1 at the 30th and 40th epochs.
Task Objectives — We define the prediction variation estimation task in the following two ways:

Regression — The model directly estimates the prediction variation as a regression task, using Mean Squared Error (MSE) as the loss function. However, by directly optimizing for MSE, this regression task’s output range could be huge in some extreme cases. As a result, we clip the outputs of the model to a certain range. Concretely, at inference phase, the model outputs are clipped to [, ], where and
are minimum and maximum of prediction variation values from training examples. Moreover, as the prediction variation distribution is extremely skewed on EMNIST (more detail in Section
5.2), we transform the raw prediction variation values to log scale as labels. For Criteo and MovieLens, we directly use the raw prediction variation values as labels. 
Classification — We evenly divide prediction variation into multiple buckets according to the percentile, and then predict which variation bucket it belongs to. We set the bucket number to be 5, and use cross entropy as the loss function for the prediction variation classification model. The bucket 0 represents the most certain class, and the bucket 4 represents the most uncertain class.
Datasets  Configuration 1  Configuration 2  

0.1  0.2  0.3  0.4  0.5  0.1  0.2  0.3  0.4  0.5  
EMNIST  0.530  0.497  0.463  0.429  0.463  0.471  0.447  0.418  0.384  0.411 
Criteo  0.808  0.793  0.769  0.731  0.689  0.714  0.703  0.686  0.656  0.619 
MovieLensC  0.639  0.660  0.674  0.688  0.690  0.442  0.450  0.458  0.467  0.473 
MovieLensR  0.604  0.646  0.670  0.683  0.695  0.494  0.503  0.511  0.518  0.521 
5.2. Variation Estimation Performance
In this section, we conduct experiments to demonstrate the dropout prediction variation estimation performance on the two configurations mentioned in Figure 2. Through out this paper, we consider the dropout rates = {0.1, 0.2, 0.3, 0.4, 0.5}.
Dropout Prediction Variation Distributions — We first show the data distribution for the dropout prediction variations. On each dataset, we divide the prediction variation range into 500 buckets when dropout rate . Then we plot the histogram of prediction variation as shown in Figure 3.
As we can see from Figure 3, the distribution of dropout prediction variations is quite different on different datasets. In particular, the EMNIST distribution is more skewed comparing to the other tasks. We show more detail of the distributions in Appendix A.2.
Dropout Variation Estimation Performance — Table 1 shows the dropout variation estimation performance as a regression task. For each setting, we estimate the dropout prediction variation as a regression task, and calculate the between the model estimated DPV and DPV labels. Table 2 shows the dropout variation estimation accuracy performance as a classification task of 5 buckets. In addition, we show the confusion matrix for the classification performance using configuration 1 when the dropout is 0.1 in Figure 4. Results are similar when using other dropout rates. We run 20 models for each setting, and report the averaged performance.
As shown in Table 1, we can see that we are able to infer the dropout prediction variation using activation strength features in almost all the settings: We are able to obtain of 0.5 or even higher for almost all the configurations, except on EMNIST. The reason is that, different from MovieLens and Criteo, the prediction variation distribution on EMNIST is extremely skewed, as shown in Figure 3. In EMNIST, most of the images can be recognized as a number without much confusion, and the prediction variations of those images are very close to 0. We show several example digitwise prediction variation distributions on EMNIST in Appendix A.2.
Table 2 shows that we can achieve overall fairly good accuracy performance on the prediction variation classification task for almost all the settings. Figure 4 shows that the classification performance for the top and bottom buckets is much better than the rest of buckets, with the accuracy close to 0.8 for MovieLens and even close to 0.9 for Criteo. On EMNIST, we observe the accuracy for bucket 0 (the most certain bucket) is around 0.8, but much lower on bucket 4 (the most uncertain bucket). The reason is still due to the extremely skewed PV distribution on EMNIST. A significant portion of images in bucket 4 do not have high PV.
The dropout prediction variation of different dropout rates might be different. In fact, the DPV correlations of different dropout rates might vary by dropout rates and datasets (more details in Appendix A.3). As a result, the dropout settings (e.g. dropout rate) might need to be tuned for the downstream applications for better performance. Note that in this paper, we do not focus on finding the optimal dropout settings for the downstream applications. Instead, given the optimal dropout setting, we provide a cheap way to produce the DPV score using neuron activation strength.
In summary, we have demonstrated that we can infer dropout prediction variation using neuron activation strength fairly well. In particular, we are able to generate the DPV with a simple forward propagation path using the variation estimation task model, and the estimated DPV strongly correlates with the DPV collected from running dropout inference 100 times.
5.3. Activation Feature Selection by Layers
All FCLs  Bottom 2 FCLs  Bottom FCL  

EMNIST  45,850    29,050 (36.6%) 
Criteo  85,050  75,050 (11.8%)  55,050 (35.3%) 
MovieLensC  85,050  75,050 (11.8%)  55,050 (35.3%) 
MovieLensR  85,050  75,050 (11.8%)  55,050 (35.3%) 
In this section, we investigate whether we can further reduce resource cost of variation estimation by examining whether a subset of the neuron activation strength features can achieve on par performance. Following the discussion in Section 4.3, we measure the effect of using different layers of the neuron activation strength as the input features to the variation estimation task.
Due to the space limit, we only show the prediction variation regression task results using configuration 1 (as shown in Figure 2), and the results are similar for configuration 2.
In Figure 5, we show the dropout prediction variation estimation performance using activation strength features from the bottom N FCLs of the target task. From the figure, we can see that using the bottom two FCLs of the neuron activation strength features can achieve almost the same performance compared to using all the FCLs activation strength features for all dropout rates tested. Using only the bottom one FCL also does not significantly drop the estimation performance for most cases. We show the variation estimation model capacity using different FCLs by calculating the number of parameters as shown in Table 3. For example, on EMNIST, the number of parameters of using all the FCLs is 2*204*100 + 100*50 + 50*1 = 45850, as the variation estimation model takes in 2*204 features (204 neurons of both binary and value features), and feeds into a neural network model of sizes [100, 50, 1]. As shown in Table 3, we can see that using the bottom FCL can achieve around 35% model complexity reduction comparing to using all the FCLs for prediction variation estimation.
In addition, instead of using the activation strength features from multiple bottom FCLs, we also tried other combinations of the layers. The results are also similar: different combinations of the layers achieve similar performance on the dropout prediction variation regression task. For example, we experimented with using one FCL neuron activation features on Criteo for DPV estimation. The for the estimated DPV and DPV labels are 0.882, 0.875, or 0.837 if use the 250neuron FCL, 100neuron FCL, or 50neuron FCL as features, respectively.
In summary, we demonstrate that we are able to use neuron activation strength features from a subset of FCLs to reach very similar performance on dropout prediction variation estimation with 12% compute reduction and small degradations in some cases with 35% compute reduction compared to using all the activation features. We believe, in general, a subset of FCLs are enough for variation estimation. To achieve desired performance and resource tradeoff, which layers to select would be dataset and task dependent.
5.4. Discussion on Dropout vs Ensemble Prediction Variation
In this paper, we focus on using neuron activations to estimate dropout prediction variations. Here, we extend the discussion on dropout prediction variations and its relationship with ensemble prediction variations.
Deep neural networks have complex loss landscapes and many local optima, and the convergence to different local optima leads to prediction variation. There are a variety of factors affecting model convergence. Different random initializations of DNN parameters allow models with the same specification to converge to different local optima. Different training data sample or order might affect model convergence too. For example, the training data may be sampled through bootstrap or Jackknife, thus producing different training data samples. In addition, distributed training exacerbates small hardware or software library differences. In short, digresses during training make models converge to different local optima, thus contributing to prediction variations (Chen et al., 2021; Summers and Dinneen, 2021; D’Amour et al., 2020; Snapp and Shamir, 2021).
We collect dropout prediction variations by training the target task with neuron dropout and inferring with dropout enabled for many times. During the training time, probabilistically dropping out neurons of a neural network can be viewed as creating a different submodel for each training step. The differences among these submodels depend on the dropout rate: A higher dropout rate introduces larger diversities among the submodels. Table 5
in Appendix shows as dropout rate increases, both prediction variation mean and variance increases. At the end of training, we have a collection of submodels contained in the final model. At inference time, if no neurons are dropped out, predictions can be viewed as the average of predictions from all submodels. Dropout prediction variations quantify prediction differences from these submodels. In this sense, dropout training can be viewed as an ensemble of these submodels. Viewing dropout as one type of ensemble is also widely held in the literature
(Srivastava et al., 2014; Hara et al., 2016; Gal and Ghahramani, 2015).As a result, we hypothesize that dropout prediction variations may be more correlated to some forms of ensemble prediction variations instead of others. For example, dropout prediction variations may not strongly correlate with PV derived from ensembles only with random parameter initialization. This hypothesis may be data dependent and remains to be verified in the future.
6. Conclusion and Future Work
In this paper, we study the problem of using neuron activation strength to estimate dropout prediction variation as an lowcost auxiliary task. We demonstrated that we can infer dropout prediction variation using neuron activation strength. In particular, we can use a subset of FCLs as the neuron activation strength features to achieve good performance with further resource reduction.
In the future, we are interested in understanding the composition of the prediction variation sources, especially the difference between ensemble and dropout derived prediction variation. The ensemble and dropout derived prediction variation could be different under different settings. The decomposition of the prediction variation sources would help us to understand the difference between ensemble and dropout prediction variation, and how to use them in the reallife applications. In addition, we are interested in exploring our neuron activation strength based dropout prediction variation estimation method in different applications, such as reinforcement learning, active learning and curriculum learning.
References
 Multiloss subensembles for accurate classification with uncertainty estimation. arXiv preprint arXiv:2010.01917. Cited by: §1.
 Uncertainty quantification through dropout in time series prediction by echo state networks. Mathematics 8 (8), pp. 1374. Cited by: §2.
 Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §2.
 Beyond point estimate: inferring ensemble prediction variation from neuron activation strength in recommender systems. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 76–84. Cited by: §1, §1, §1, §1, §2, §3.2, §3.2, §3.3, §3, §4.2, §4.2, §5.1, §5.4.
 EMNIST: extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2921–2926. Cited by: 3rd item.
 Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: 2nd item.
 Underspecification presents challenges for credibility in modern machine learning. arXiv preprint arXiv:2011.03395. Cited by: §1, §5.4.

Towards safe deep learning: accurately quantifying biomarker uncertainty in neural network predictions
. In International Conference on Medical Image Computing and ComputerAssisted Intervention, pp. 691–699. Cited by: §2.  Towards safe autonomous driving: capture uncertainty in the deep neural network for lidar 3d vehicle detection. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 3266–3273. Cited by: §2.
 Deep ensembles: a loss landscape perspective. arXiv preprint arXiv:1912.02757. Cited by: §2.

Bayesian convolutional neural networks with bernoulli approximate variational inference
. arXiv preprint arXiv:1506.02158. Cited by: §2, §5.4.  Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §1, §2, §2, §4.1.
 Concrete dropout. arXiv preprint arXiv:1705.07832. Cited by: §2.
 Deep bayesian active learning with image data. In International Conference on Machine Learning, pp. 1183–1192. Cited by: §1.
 A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342. Cited by: §2.
 Analysis of dropout learning regarded as ensemble learning. In International Conference on Artificial Neural Networks, pp. 72–79. Cited by: §5.4.
 The movielens datasets: history and context. Acm transactions on interactive intelligent systems (tiis) 5 (4), pp. 1–19. Cited by: 1st item.
 Training independent subnetworks for robust prediction. arXiv preprint arXiv:2010.06610. Cited by: §2.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §1.  Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173–182. Cited by: §3.2.
 Stochastic variational inference.. Journal of Machine Learning Research 14 (5). Cited by: §2.
 Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §3.2.

To trust or not to trust a classifier
. In Advances in neural information processing systems, pp. 5541–5552. Cited by: §1.  Uncertaintyaware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182. Cited by: §2.

Bayesian segnet: model uncertainty in deep convolutional encoderdecoder architectures for scene understanding
. arXiv preprint arXiv:1511.02680. Cited by: §2.  Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.
 Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pp. 6402–6413. Cited by: §1, §1, §2.
 Timeseries extreme event forecasting with neural networks at uber. In International conference on machine learning, Vol. 34, pp. 1–5. Cited by: §1, §4.1.
 Quantifying the uncertainty of deep learningbased computeraided diagnosis for patient safety. Current Directions in Biomedical Engineering 5 (1), pp. 223–226. Cited by: §1, §2.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: 3rd item, §3.2.
 Ensemble distribution distillation. arXiv preprint arXiv:1905.00076. Cited by: §1.
 Distilling ensembles improves uncertainty estimates. In Third Symposium on Advances in Approximate Bayesian Inference, Cited by: §2.
 Dropconnect is effective in modeling uncertainty of bayesian deep networks. Scientific reports 11 (1), pp. 1–14. Cited by: §2.
 Evaluating bayesian deep learning methods for semantic segmentation. arXiv preprint arXiv:1811.12709. Cited by: §2.
 Rectified linear units improve restricted boltzmann machines. In ICML, Cited by: §4.3.
 Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pp. 13991–14002. Cited by: §1, §5.1.
 Variational bayesian inference with stochastic search. arXiv preprint arXiv:1206.6430. Cited by: §2.
 Challenges in markov chain monte carlo for bayesian neural networks. arXiv preprint arXiv:1910.06539. Cited by: §2.
 Can you trust this prediction? auditing pointwise reliability after learning. arXiv preprint arXiv:1901.00403. Cited by: §1.
 Synthesizing irreproducibility in deep networks. arXiv preprint arXiv:2102.10696. Cited by: §5.4.
 Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §5.4.
 Nondeterminism and instability in neural network optimization. arXiv preprint arXiv:2103.04514. Cited by: §1, §5.4.
 Deep subensembles for fast uncertainty estimation in image classification. arXiv preprint arXiv:1910.08168. Cited by: §1, §2.
 Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. Cited by: §3.2.
 Uncertainty quantification in molecular simulations with dropout neural network potentials. npj Computational Materials 6 (1), pp. 1–10. Cited by: §1, §2.
 BatchEnsemble: an alternative approach to efficient ensemble and lifelong learning. Eighth International Conference on Learning Representations (ICLR 2020). Cited by: §1, §2.
 Batchensemble: an alternative approach to efficient ensemble and lifelong learning. arXiv preprint arXiv:2002.06715. Cited by: §1.
 Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791. Cited by: §2.
 Multiclass active learning by uncertainty sampling with diversity maximization. International Journal of Computer Vision 113 (2), pp. 113–127. Cited by: §2.

Selfsupervised learning for largescale item recommendations
. CIKM. Cited by: §4.1.
Appendix A Appendix
Datasets  Ensemble  Inference with Dropout  Inference without Dropout  

0.1  0.2  0.3  0.4  0.5  0.1  0.2  0.3  0.4  0.5  
EMNIST  0.992  0.990  0.986  0.982  0.975  0.962  0.993  0.992  0.992  0.992  0.991 
Criteo  0.799  0.797  0.796  0.793  0.790  0.787  0.799  0.798  0.798  0.797  0.795 
MovieLensC  0.471  0.469  0.468  0.467  0.464  0.459  0.472  0.473  0.473  0.472  0.470 
MovieLensR  0.777  0.799  0.799  0.802  0.811  0.824  0.775  0.774  0.776  0.777  0.780 
a.1. Target Task Performance
In this section, we show the target task performance with different settings. Table 4 presents the target task accuracy for models trained with different dropout rates from 0.1 to 0.5. In this table, MovieLensC and EMNIST are evaluated by accuracy, Criteo by AUC, and MovieLensR by mean squared error (MSE). Therefore, higher numbers for MovieLensC, EMNIST, and Criteo correspond to better accuracy, while lower numbers for MovieLensR correspond to better accuracy. To get robust model accuracy, given a dropout rate , we train 20 models independently with dropout and report the average target task accuracy. When dropout is enabled at inference time using the same dropout rate , we run inference 100 times and report the average accuracy in the column group ”Inference with Dropout”. We also record accuracy without dropout at inference time in the column group ”Inference without Dropout”. For this setting, because there is no additional randomness at inference time, only one inference is needed. The first column of Table 4 shows the ensemble model quality as a comparison.
From the table, we can see that accuracy of ”Inference without Dropout” for all target tasks and dropout rates is very close to ”Ensemble”. Accuracy for ”Inference without dropout” relatively holds up as well.
Datasets  Ensemble  0.1  0.2  0.3  0.4  0.5  

mean  std  mean  std  mean  std  mean  std  mean  std  mean  std  
EMNIST  1.897  8.932  1.693  5.555  3.456  7.546  5.702  8.950  9.008  10.195  14.747  11.734 
Criteo  0.027  0.015  0.020  0.011  0.027  0.013  0.032  0.014  0.038  0.016  0.044  0.017 
MovieLensC  4.488  3.096  1.265  0.758  1.694  0.925  2.113  1.111  2.599  1.346  3.235  1.653 
MovieLensR  0.179  0.058  0.146  0.034  0.148  0.042  0.161  0.055  0.177  0.068  0.199  0.079 
a.2. Dropout Prediction Variation Distributions
In this section, we provide supplemental details on dropout prediction variation distributions for each target task. In addition to Figure 3, we add Table 5 here to show prediction variation mean and std for different target tasks over all training examples of . The first column lists the prediction variation mean and std for randominit and randomshuffled ensembles, i.e. the most common ensemble configuration. The remaining columns show results for dropout rate from 0.1 to 0.5.
From the table, we can observe the prediction variation ranges are quite different on different target tasks. We believe the distribution differences reflect that different datasets have very different prediction variation properties. For example, most of the EMNIST data is easy to recognize and have very low prediction variations but a small fraction have high prediction variations so std is very high compared to the mean. We can see MovieLensC PV mean and std do not show this pattern. Another factor contributing to the distribution differences is task specific prediction variation definition. For example, MovieLensR uses std of predictions as prediction variation according to Equation 1, and EMNIST uses the KL divergence based measure according to Equation 2.
Further, we can see that when we increase the dropout rate, the mean and std of the prediction variation consistently increase. This result aligns with our intuition that when dropout rate increases, more diversity is introduced into the model and allows dropout predictions differ more. In Figure 6, we show prediction variation distribution of two different digits in EMNIST. Due to page limitation, we only show digit 0 and 4 under dropout rate 0.1 and 0.3. Digit 0 is one of the most easily recognized digits while digit 4 sometimes gets misclassified as digit 9 and vice versa. Digit 0’s PV distribution doesn’t change much from dropout rate 0.1 to 0.3. While digit 4’s PV at dropout rate 0.3 is significantly less skewed compared to dropout rate 0.1. Although the PV distribution of digit 4 looks less skewed at dropout rate 0.3, the overall distribution over all digits is still very skewed and similar to Figure 3.
a.3. Prediction Variation Correlation
In this section, we show PV correlations of different dropout rates. Given the prediction variation estimated with two dropout rates, we calculate their as the correlations. We repeat this process 20 times with different model parameter initialization, and report the averaged as shown in Figure 7. For correlations of the same dropout rate, we simply calculate the correlation between a model and its retrained model, and also report the averaged over 20 model pairs. Again the dropout prediction variation is collected by running dropout inference 100 times.
Figure 7 shows the dropout prediction variation correlation by comparing different dropout rate pairs on the four tasks. From the Figure, we can see that dropout prediction variations collected using different dropout rates or even same dropout rates, could be quite different, and the behaviors are different on different datasets. Given two sets of prediction variations collected using dropout, if the difference of the dropout rates is bigger, the corresponding correlation generally gets worse. For example, the correlations on the diagonal usually are higher than those on nondiagonal cells.
We also calculated the correlation of the dropout prediction variation by running the dropout inference 50 and 10 times. Table 6 shows the dropout prediction variation correlation on MovieLensR by running dropout inference 100, 50, and 10 times. We observe that in general the prediction variation correlation tends to increase with increased number of inferences.
Number of Inferences  Dropout Rate  

0.1  0.2  0.3  0.4  0.5  
10  0.373  0.450  0.552  0.600  0.620 
50  0.617  0.684  0.769  0.801  0.810 
100  0.677  0.730  0.807  0.834  0.842 
Comments
There are no comments yet.