1 Introduction
Neural networks (NNs) have enjoyed much success in recent years achieving stateoftheart performance on a large number of tasks within domains such as natural language processing [transformers], speech recognition [speechrec]
and computer vision
[cvision]. Unfortunately, despite the prediction performance of NNs they are known to yield poor estimates of the uncertainties in their predictions—in knowing what they do not know [randomseedensemble, calbirationmnn]. With the increasing application of neural network based systems in performing safetycritical tasks such as biometric identification [biometric], medical diagnosis [medical] or fully autonomous driving [kendalldriving], it becomes increasingly important to be able to robustly estimate the uncertainty in a model’s prediction. By having access to accurate measures of predictive uncertainty, a system can act in a more safe and informed manner.Ensemble methods, and related schemes, have become the standard approach for uncertainty estimation. randomseedensemble
proposed generating a deep (randomseed) ensemble by training each member model with a different initialisation and stochastic gradient descent (SGD). Not only does this ensemble perform significantly better than a standard trained NN, it also displays better predictive uncertainty estimates. Although simple to implement, training and deploying an ensemble results in a linear increase in the computational cost. Alternatively
mcdropout introduced the Monte Carlo (dropout) ensemble (MC ensemble) which at test time estimates predictive uncertainty by sampling members of an ensemble using dropout. Though this approach generally does not perform as well as a deep ensemble (given the same computational power and neglecting memory) [randomseedensemble], it is significantly cheaper to train as it integrates the ensemble generation method into training.Despite ensemble generation methods being computationally more expensive, they have an important ability to decompose predictive (total) uncertainty into data and knowledge uncertainty [decomposition, mcdropout]. Knowledge or epistemic uncertainty refers to the lack of knowledge or ignorance about the most optimal choice of model (parameters) [epistemic]. As additional data is collected, the uncertainty in model parameters should decrease. This form of uncertainty becomes important whenever the model is tasked with making predictions for outofdistribution datapoints. For indistribution inputs, it is expected that the trained model can return reliable predictions. On the other hand, data or aleatoric uncertainty, represents inherent noise in the data being modelled, for example from overlapping classes. Even if more data is collected, this type of noise is inherent to the process and cannot be avoided or reduced [priornetworks, mcdropout, trustuncertainty]. The ability to decompose and distinguish between these sources of uncertainty is important as it allows the cause of uncertainty in the prediction to be known. This in turn advises the user how the prediction should be used in downstream tasks [bald, batchbald].
Summary of contributions
: In this work we make two important contributions to NN classifier training and uncertainty prediction. First we introduce
selfdistribution distillation (S2D), a new general training approach that in an integrated, simultaneous fashion, trains a teacher ensemble and distribution distils the knowledge to a student. This integrated training allows the user to bypass training a separate expensive teacher ensemble while distribution distillation [en2d] allows the student to capture the diversity and model a distribution over ensemble member predictions. Additionally, distribution distillation would give the student the ability to estimate both data and knowledge uncertainty in a single forward pass unlike standard NNs which inherently can not decompose predictive uncertainty, and unlike ensemble methods which can not perform the decomposition in a single pass. Second, we train an ensemble of these newly introduced models and investigate different distribution distillation techniques giving rise to hierarchical distributions over predictions for uncertainty. This approach is useful when there are no, or few, computational constraints in the training phase but still require robust uncertainties and efficiency in the deployment stage.2 Background
This section describes two techniques for uncertainty estimation. First, ensemble methods for predictive uncertainty estimation will be viewed from a Bayesian viewpoint. Second, a specific form of distillation for efficient uncertainty estimation will be discussed.
2.1 Ensemble Methods
From a Bayesian perspective the parameters,
, of a neural net are treated as random variables with some prior distribution
. Together with the training data , this allows the posterior distribution to be derived. To obtain the predictive distribution over all classes (for some input ), marginalisation over is required:Since finding the true posterior is intractable, a variational approximation is made [jvi, approx1, approx2, swag]. Furthermore, marginalising over all weight values remains intractable leading to a sampling ensemble, approximation method [mcdropout, randomseedensemble]:
Here, an ensemble generation method is required to obtain the predictive distribution and uncertainty. Two previously mentioned approaches to generate an ensemble are deep (naive) randomseed and MCdropout ensemble^{1}^{1}1Indepth comparisons of ensemble methods were conducted in trustuncertainty, pitfalls. Deep ensembles are based on training models on the same data but with different initialisations leading to functionally different solutions. On the other hand, a MCdropout ensemble explicitly defines a variational approximation through the hyperparameters of dropout [dropout] (used during training), allowing for straightforward sampling of model parameters. Another approach, SWAGaussian [swag]
, finds a Gaussian approximation based on the first two moments of stochastic gradient descent iterates. Unlike the deep ensemble approach, and similar to MCdropout, this method allows for simple and efficient sampling but suffers from higher memory consumption. Even a diagonal Gaussian approximation requires twice the memory of a standard network.
There also exists alternative memory and/or compute efficient ensemble approaches such as BatchEnsembles [batchens] and MIMO [mimo]. While the former approach is parameter efficient it requires multiple forward passes at test time similar to MC ensembles. The latter avoids this issue by generating independent subnetworks within a single deep model through the simultaneous "mixing" of multiple inputs and generation of multiple outputs. Although the training cost of such a system could be comparable to a deep ensemble [mimo]
, the inference cost is significantly lower. However, MIMO suffers from several drawbacks, one being the requirement of several input and output layers, which in large scale classification could consist of many millions of parameters. Finally, while many of the mentioned ensemble methods can straightforwardly be generalised to sequence tasks such as neural machine translation, MIMO presents a further challenge. It becomes a nontrivial task to "mix" input sequences of different lengths and address how this should be handled by sequence models such as transformers.
2.1.1 Predictive Uncertainty Estimation
Given an ensemble, the goal is to estimate and decompose the predictive uncertainty. First, the entropy of the predictive distribution can be seen as a measure of total uncertainty. Second, this can be decomposed [decomposition, whatunc] as:
(1) 
where is mutual information and represents entropy. This specific decomposition allows total uncertainty to be decomposed into separate estimates of knowledge and data uncertainty. Furthermore, the conditional mutual information can be rephrased as:
For an indomain sample the mutual information should be low as appropriately trained models should be close to the predictive distribution. High predictive uncertainty will only occur if the input exists in a region of high data uncertainty, for example when an input has significant class overlap. When the input is outofdistribution of the training data, one should expect inconsistent, different, predictions leading to a much higher knowledge uncertainty estimate.
2.2 Ensemble Distillation Methods
Ensemble methods have generally shown superior performance on a range of tasks but suffer from being computationally expensive. To tackle this issue, a technique called knowledge distillation (KD) and its variants were developed for transferring the knowledge of an ensemble (teacher) into a single (student) model while maintaining good performance [kd, sequencekd, onlinekd]. This is generally achieved by minimising the KLdivergence between the student prediction and the predictive distribution of the teacher ensemble. In essence, KD trains a new student model to predict the average prediction of its teacher model. However from the perspective of uncertainty estimation the student model no longer has any information about the diversity of various ensemble member predictions; it was only trained to model the average prediction. Hence, it is no longer possible to decompose the total uncertainty into different sources, only the total uncertainty can be obtained from the student. To tackle this issue ensemble distribution distillation (En2D) was developed [en2d].
Let signify a categorical distribution, that is . The goal is to directly model the space of categorical predictions made by the ensemble. In work developed by en2d this is done by letting a student model (with weights ) predict the parameters of a Dirichlet, which is a continuous distribution over categorical distributions:
(2) 
The key idea in this concept is that we are not directly interested in the posterior but how predictions for particular inputs behave when induced by this posterior. Therefore, it is possible to replace with a trained distribution . It is now necessary to train the student given the information from the teacher which is straightforwardly done using negative loglikelihood:
(3) 
A decomposable estimate of total uncertainty is then possible by using conditional mutual information between the class and prediction priornetworks:
(4) 
This decomposition has a similar interpretation to eq. (1). Using a Dirichlet model, these uncertainties can be found using a single forward pass, achieving a much higher level of efficiency compared to an ensemble. Assuming this distillation technique is successful, the distribution distilled student should be able to closely emulate the ensemble and be able to estimate similar high quality uncertainties on both ID and OOD data.
However, ensemble distribution distillation is only applicable and useful when the ensemble members are not overconfident and display diversity in their predictions—there is no need in capturing diversity when there is none. For example, many state of the art convolutional neural networks are overparameterised, display severe overconfidence and can essentially achieve perfect training accuracy which restricts the effectiveness of distribution distillation in terms of capturing the diversity in the teacher ensemble
[calbirationmnn, singleshot, largescaleunc]. Furthermore, this method can only be used when an ensemble is available, leading to a high training cost.3 SelfDistribution Distillation
In this section we propose selfdistribution distillation (S2D) for efficient training and uncertainty estimation, bypassing the need for a separate teacher ensemble. This combines:

parameter sharing
: allowing the teacher and student to share a common feature extraction base would accelerate training significantly, each will branch off and have their own head;

stochastic regularisation: the teacher can generate multiple predictions efficiently by forward propagating an input through its head (with a stochastic regulariser) several times, emulating the behaviour of an ensemble;

distribution distillation: while the teacher branch is trained on crossentropy, the student is taught to predict a distribution over teacher predictions capturing the diversity compactly.
This process is summarised in Fig. 2.
The proposed approach can take many specific forms with regards to the type of feature extraction module, stochastic regulariser, teacher branch and student modelling choice. For example, the teacher could entail a much larger branch capturing complex patterns in the data, while the student could consist of a smaller branch used for compressing teacher knowledge into a more efficient form, at test time. On the other end, training efficiency can be achieved by forcing the teacher and student share the same branch parameters.
In this work, we choose a highly efficient model configuration, shown in Fig. 1
. The main functional difference between the teacher and the student branches is the use of logit values,
: for the teacher branch a probability is predicted; whereas the student uses the logits for a Dirichlet distribution. Furthermore the teacher uses stochastic regularisation techniques (SRTs) in generating multiple teacher predictions, analogous to an ensemble. In this work multiplicative Gaussian noise (Gaussian dropout) with unit mean and uniformly random standard deviation is used. This form was chosen due to simplicity of sampling and possible ensemble diversity. There is a wide range of other choices regarding what SRTs to use, from Bernoulli dropout, additive Gaussian noise to deciding at which teacher branch layers this should be introduced. Furthermore, since the Dirichlet distribution has bounded ability to represent diverse ensemble predictions
[en2d], simply generating multiple teacher prediction by propagating through the last layer will not limit the ability of the model. To further improve the memory efficiency of the model, a single final linear layer shared by both student and teacher branches is used. This parameter sharing makes the S2D model efficient even when the number of classes is large, and does not use any more parameters compared to a standard model. Note any NN classifier can be cast into a selfdistribution distillation format by inserting stochasticity prior to the final linear layer and can easily be combined with many other approaches such as MIMO [mimo] and SWAG [swag].This choice of integrating ensemble teacher training and distribution distillation into a single entity utilising parameter tying also serves as a regulariser (optimising two objectives using the same set of weights) and allows for inexpensive training. The only added training cost is from multiple forward passes through the final linear layer, a process which can easily be parallelised. Additionally, the restricted form of Fig. 1 brings some numerical stability. As noted by en2d, optimising a student to predict a Dirichlet distribution can be unstable when there is a lack of common support between prediction and extremely sharp teacher outputs. However, note that teacher predictions are closely related to the expected student prediction:
leading to increased common support. Additionally, stochasticity in the teacher forces the outputs to have some diversity, mildly limiting overconfidence.
3.1 Training Criteria and Temperature
Now we train the teacher branch using crossentropy, and simultaneously, use the teacher predictions to train the student branch. Let the weights of this model be denoted by and say we have some inputtarget pair . The teacher loss (for a single sample) is then:
where is the indicator function. The student branch could be trained using loglikelihood as in eq. (3) but it has been found that this approach could be unstable [gecdistillation, largescaleunc]. Instead we use the teacher categorical predictions in estimating a proxy teacher Dirichlet using maximum loglikelihood. The resulting student loss is KLdivergence based:
The proxy Dirichlet is estimated using a numerical approach developed by minka. The overall training loss becomes with a small constant .
Deep learning models often overfit on training data leading to less informative outputs. To alleviate these issues we integrate temperature scaling in the student branch loss. While training the teacher branch predictions on crossentropy we temperature scale the same predictions and use the resulting ones in estimating a proxy teacher Dirichlet. The student branch will repeatedly be taught to predict a smoother/wider Dirichlet distribution, while the teacher branch’s objective is to maximise the probability of the correct class resulting in a middle ground.
4 SelfDistribution Distilled Ensemble Approaches
If computational resources during the training phase are not constrained it would open up the possibility for selfdistribution distilled ensembles and various hierarchical distillation approaches of such models. First it can be noted that the ensemble generation methods mentioned in previous sections can easily be used with the S2D models in the previous section. The predictive distribution of such an ensemble would take the following form:
Furthermore, an ensemble of Dirichlet models can be used to estimate similar uncertainty measures as previously described:
This is a generalisation of eq. (4) since specific weights have been replaced with conditioning on the dataset . Computing these uncertainties requires only a few modifications compared to the standard ensemble in eq. (1).
4.1 Hierarchical Distribution Distillation
Next, the most natural step is to transfer the knowledge of an S2D (Dirichlet) ensemble into a single model. A choice needs to be made regarding the hierarchy of student modelling: should the student predict a categorical^{2}^{2}2Since transferring knowledge from a Dirichlet ensemble into a student predicting a categorical critically loses information about diversity, this method will not be investigated., Dirichlet, or a distribution over Dirichlets—hereby given the family name hierarchical distribution distillation (H2D). Initially we start by training a student model to predict a single Dirichlet identical to eq. (2). However, since the S2D ensemble provides, for an input , a set of Dirichlets a modified distillation criterion is needed:
where . This KLdivergence based loss also allows the reverse KL criterion to be used [revKL] if desired. One criticism of this form of model, Dirichlet H2D (H2DDir), is that the diversity across ensemble members is lost, similar to the drawback in standard distillation. Therefore, we seek a distribution over Dirichlets to capture this higher level of diversity.
To model the space of Dirichlets we need to define a distribution over the parameters. Here we are faced with a choice: (1) model the parameters directly (restricted to the nonnegative real space) or (2) apply a transformation to simplify the modelling. Here a logarithmic transformation is applied and a simple distribution over the Dirichlet parameters, a diagonal Gaussian, to be used (see Appendix C for a justification for this modelling choice). With these building blocks, the goal of H2D is to train a student model with weights and predict the parameters of a diagonal Gaussian (H2DGauss):
where . By sampling from this Gaussian, one can obtain multiple Dirichlet distributions similar to, but cheaper than, an S2D ensemble. Clearly, the flexibility of such a model can easily be extended by allowing the model to predict a fully specified covariance, however due to computational tractability only diagonal covariance models are used in this work. Note that a secondary head is required for such a model. In a similar fashion to previous approaches, this model can be trained using negative loglikelihood or by estimating a proxy teacher Gaussian and use KLdivergence. In this work we have adopted the proxy approach, see Appendix A.1 for details.
5 Experimental Evaluation
This section investigates the selfdistribution distillation approach on classifying image data. First, this approach is compared to standard trained models and established ensemble based methods (deep ensembles and MCdropout) as well as the diagonal version of SWAG (SWAGDiag) and MIMO. Second, selfdistribution distillation is combined with all above mentioned approaches. Finally, knowledge distillation is compared to hierarchical distribution distillation of Dirichlet ensembles.
This comparison is based on two sets of experiments. The first set compares the performance of all baselines and proposed models in terms of image classification performance and calibration on CIFAR100 [cifar] without (C100) and with (C100+) a data augmentation scheme. The second set of experiments compares the outofdistribution/domain (OOD) detection performance using various unseen datasets such as LSUN [lsun], Tiny ImageNet [tim] and SVHN [svhn].
All experiments are based on training DenseNetBC (k = 12) models with a depth of 100 [densenet]. For ensemble generation methods models were sampled (in the case of MCdropout ensembles and SWAG) or trained (in the case of deep ensembles). For MIMO we use two output heads () due to limited capacity in the chosen model [mimo]. Note that for this choice of model it was not possible to use ensemble distribution distillation since DenseNetBC models display high confidence on the training data of CIFAR100 causing instability in distillation. All single model training runs were repeated 5 times; mean 2 standard deviations are reported. The experimental setup and additional experiments are described in Appendix AD.
5.1 CIFAR100 Classification Performance Experiments
The first batch of experiments show the classification performance using a range of metrics such as accuracy, negative loglikelihood (NLL) and expected calibration error (ECE), see Table 1. Perhaps the most noteworthy result is the improvement in all metrics and datasets of a selfdistribution distilled model compared to its standard counterpart. The improvement is more than 2 standard deviations. A similar picture can be observed for the S2D versions of SWAGDiag and MCdropout which, without any notable gain in cost of training and inference, improve upon their equivalent standard counterparts in all metrics. Regarding MIMO a small gain can still be observed when switching to the selfdistribution distillation framework but this boost is smaller. Finally for the deep ensemble approach, the S2D version only shows a marginal improvement in accuracy and NLL but a notable increase in ECE. In fact, it is observed that ensembling standard and S2D models reduces and increases ECE respectively. This trend is associated with the level of ensemble calibration. Unlike a standard deep ensemble, the members of the S2D counterpart are close to being calibrated, displaying little to no overconfidence. Ensembling these calibrated models lead to underconfident average predictions hence, the increased calibration error. Note, calibration error and negative loglikelihood can easily be reduced posttraining by temperature scaling predictions.
The next set of comparisons regard distilled models, the final block of Table 1. As expected they all perform in between the performance of an individual model and deep ensemble. While standard ensemble distillation (knowledge distillation) was found to achieve better accuracy than other distillation methods, this success was highly dependent on the value of temperature scaling used. A suboptimal choice of temperature can drastically reduce performance. On the other hand, when distilling an S2D ensemble, no additional hyperparameters are needed. We observe that while both H2DDir and H2DGauss obtained a higher NLL they also achieved better calibration than their S2D ensemble teacher. Lastly, one can observe that H2DDir and H2DGauss both outperform the standard SWAGDiag and MCdropout ensemble using only a single forward pass. Although these distilled models involve an expensive training phase (a teacher ensemble is required) they are able to, at test time, achieve much higher computational efficiency and estimate and decompose uncertainty.
5.2 Outofdistribution Detection Experiments
The second batch of experiments investigate the outofdistribution detection performance of models. The goal is to differentiate between two types of data, negative indistribution (ID, sampled from the same source as the training data) and positive outofdistribution (OOD) data.
In all experiments the models were trained on C100. The ID data was always set to the test set of C100 and OOD data was the test set of LSUN/TIM/SVHN. Both LSUN and TIM examples had to be resized or randomly cropped as preprocessing before being fed to the model. The detection was done using four uncertainty estimates: confidence, total uncertainty (TU), data or aleatoric uncertainty (DU) and knowledge or epistemic uncertainty (KU). Performance was measured using the threshold independent AUROC [auroc] and AUPR [aupr] metrics. Due to limited space, some LSUN and TIM experiments have been moved to Appendix B.1.
First, there is not a single case in Tables 2 and 3 where an individual model, MIMO, SWAGDiag or MCdropout ensemble is able to outperform the detection performance of a single S2D model. This statement holds for all the analysed uncertainties apart from confidence where both MIMO and SWAGDiag are insignificantly better. When comparing to a deep ensemble, the S2D model is outperformed in many cases. The general trend is that the ensemble is able to output marginally higher quality confidence and total uncertainty estimates in most datasets, but that S2D sometimes outperforms the ensemble when using data uncertainty (as in Table 3).
Interestingly, the MC ensemble seems to degrade the quality of confidence and total uncertainty when compared to its standard individual counterpart. However, since a MCdropout ensemble can estimate data uncertainty, it is able to outperform the standard model overall. Similarly, the S2D MC ensemble generally has inferior detection performance compared to its single deterministic model equivalent. The only exception is in detecting SVHN where the ensemble has marginally better data uncertainty estimates. Regarding SWAGDiag and MIMO they both gain from being cast into a selfdistribution distillation viewpoint drastically increasing their detection performance without additional cost at inference.
Although the S2D deep ensemble, when compared to its vanilla counterpart, wasn’t able to show any noticeable accuracy boost (on CIFAR100) it does outperform in this detection task. The only case where the S2D ensemble was not able to outshine the vanilla ensemble is when both use knowledge uncertainty to detect SVHN examples using the AUPR metric. Generally, S2D based systems outperform their standard counterparts.
Regarding distillation based approaches, it is observed that knowledge ensemble distillation, EnD, is able to outperform the standard model in all cases except SVHN detection, and in no case is able to reach the deep ensemble performance, which it was distilled from. On the other hand, both the H2DDir and H2DGauss models outperform the distilled model and are able to decompose predictive uncertainty. Specifically we discover that H2DDir is able to generate the highest quality knowledge uncertainty estimates in almost all cases, and is able to outperform its S2D ensemble teacher using this uncertainty. The H2DGauss model however, was not able to boast similar high quality knowledge uncertainty. Instead, this model displayed the generally best performing data uncertainty estimates, able to outperform the vanilla deep ensemble in all cases, and the S2D equivalent in all but SVHN detection.
6 Conclusion
Uncertainty estimation within deep learning is becoming increasingly importance, with deep ensembles being the standard for estimating various sources of uncertainty. However, ensembles suffer from significantly higher computational requirements. This work proposes selfdistribution distillation (S2D), a novel collection of approaches for directly training models able to estimate and decompose predictive uncertainty, without explicitly training an ensemble, and can seamlessly be combined with other approaches. Additionally, if one is not resource restricted during the training phase, a novel approach, hierarchical distribution distillation (H2D), is described for transferring/distilling the knowledge of S2D style ensembles into a single flexible and robust student model. It is shown that S2D models are able to outperform standard models and rival MC ensembles on the CIFAR100 test set. Additionally, S2D is able to estimate higher quality uncertainty estimates compared to standard models and MC ensembles and in most cases, able to better detect outofdistribution images from the LSUN, SVHN and TIM datasets. Combination of S2D with other promising approaches such as MIMO and SWAG also show additional gains in accuracy and detection performance. S2D is also able to rival the deep ensemble in certain cases even though it only requires a single forward pass. Furthermore, S2D deep ensembles and H2D derived student models are shown to notably outperform the deep ensemble in almost all detection problems. These promising results show that the efficient selfdistribution and novel hierarchical distribution distillation approaches have the potential to train robust uncertainty estimating models able to outperform deep ensembles. Future work should further investigate selfdistribution distillation in other domains such as natural language processing and speech recognition. The need for more efficient uncertainty estimation is especially useful for these areas as they often utilise largescale models. Furthermore, one could also analyse variations of S2D such as utilising less weight sharing, generating more diverse teacher predictions or changing the student modelling choices.
References
References
Appendix A Experimental Configuration
Dataset  Train  Test  Classes 

CIFAR100  50000  10000  100 
LSUN    10000  10 
SVHN    26032  10 
Tiny ImageNet    10000  200 
All models were trained on the CIFAR100 dataset, with and without data augmentation. The augmentation scheme involves randomly mirroring and shifting images following daug1, daug2. Remaining datasets were used as outofdistribution samples in the detection task.
All individual models, and ensemble members were based of off the DenseNetBC (, 100 layers) architecture and trained according to densenet
. SWAGDiag was obtained by checkpointing the weights of the last 20 epochs with a reduced learning rate of
. MIMO with two output heads was trained using the same setup as for the standard model. To keep training costs comparable to (S2D) individual models, no batch or input repetition was used [mimo]. Similarly all selfdistribution distilled equivalents were trained with identical training recipes with the addition of a student loss ().Regarding distilled based models, the EnD baseline was trained using negative loglikelihood using the average temperature scaled prediction of the teacher ensemble, with . For the hierarchical distribution distillation approaches the students were first initialised with the weights of an S2D model trained for 150 epochs, for increased stability. Thereafter, each student was trained using the appropriate H2D criteria with a significantly reduced learning rate. H2DDir was trained using for an additional 150 epochs. H2DGauss required an initial learning rate of which was reduced by a factor of 2 after 75 and 150 epochs. It was trained for 170 epochs. Additionally, uncertainties were computed by generating 50 samples from each Gaussian prediction, since this modelling choice does not result in closed form expressions.
a.1 Proxy Target Training
Since the use of negative loglikelihood can be unstable in training S2D and distilling H2D models we utilise proxy targets and KLdivergence. It has already been mentioned that the proxy target in S2D follows:
(5) 
Each categorical prediction will be temperature scaled, with , to mitigate overconfident predictions. While H2DDir does not require any proxy targets, the Gaussian equivalent does. The proxy diagonal Gaussian, estimated according to maximum loglikelihood, has a closedform expression:
(6) 
where represents an elementwise multiplication. This is then used in a KLdivergence based loss, training the student with prediction according to:
(7) 
Note however, that the proxy targets are detached from any back gradient propagation calculations. This is to simulate typical teacherstudent knowledge transfer where teacher weights are kept fixed during student training.
Appendix B Outofdistribution Detection
This section covers remaining outofdistribution detection experiments. First, we cover the LSUN and Tiny ImageNet detection problem for all models considered in section 5.2. Thereafter, additional experiments will be run on ensembles of various sizes. This is to investigate if the low quality of knowledge uncertainty estimates is caused by a limited number of ensemble members.
b.1 Tiny ImageNet Experiments
Similar to the results section 5.2 the S2D Deep ensemble and H2DGauss outperformed all other models, see Table 6 and 7. The only exception is the use of confidence on resized TIM with the AUROC metric where the Deep ensemble marginally outperforms the S2D equivalent. However, unlike previous results, knowledge uncertainty seems to perform on par with or outperform confidence. The one exception is the MC ensemble.
b.2 Ensemble Size Experiments
Knowledge uncertainty was found to have underwhelming performance (especially for MC and Deep ensembles) and did not show similar trends to prior work [priornetworks, structured, en2d]. To possibly mitigate this, the ensemble size was increased as a smaller number of models could lead to inaccurate measures of diversity and knowledge uncertainty. Results are compiled in Tables 813.
Ensemble Type  Ensemble Size (M)  Acc.  NLL  %ECE 

5  75.6 0.9  0.94 0.04  6.67 1.18  
MC  10  75.8 0.9  0.92 0.04  6.11 1.11 
20  76.0 1.0  0.91 0.04  5.81 1.12  
5  79.3  0.76  1.44  
Deep  10  80.1  0.71  1.91 
20  80.3  0.68  2.19 
Performance on the CIFAR100 test set is shown in Table 8. Increasing the ensemble size leads to improved accuracy and lower negative loglikelihoods as would be expected. The MC ensemble also becomes better calibrated. The Deep ensemble on the other hand has increasing calibration error with the number of members. This is due to the ensemble prediction becoming underconfident when averaging over a large number of members.
Outofdistribution detection performance on LSUN, SVHN and TIM are compiled in Tables 913. Although the MC ensemble enjoys improved accuracy when increased in size, it seems to remain relatively unaffected in terms of OOD detection using any uncertainty metric. In detecting LSUN using random crops, the performance of KU interestingly deteriorates notably. Overall this points to MC ensembles’ lacking ability in utilising new information from additional ensemble member draws/samples for better uncertainty estimation. Regarding the Deep ensemble, it generally improves with increasing size with any metric, however with diminishing returns. In this case all uncertainties improve with ensemble size, not only knowledge uncertainty. Therefore it seems that the cause for confidence, total and data outperforming knowledge uncertainty is not due to the ensemble size being limited to five members.
Appendix C Behaviour of Uncertainties
This section investigates how the uncertainties produced from a vanilla Deep ensemble differ from selfdistribution distilled derived systems, and how well hierarchical distribution distillation captures the behaviour of its teacher. The comparison will be made between the indomain CIFAR100 and, out of simplicity, only the outofdomain SVHN test set.
Figure 3 shows the contrast of various uncertainties between an CIFAR100 (ID) and SVHN (OOD) test sets. Clearly, the S2D systems output ID uncertainties in a consistent manner, even matching the conceptually different Deep ensemble. Observe that S2D integrates temperature scaling (smoothing predictions) into the training of models; total and data uncertainties^{3}^{3}3Knowledge uncertainty does not necessarily increase with temperature. estimated by these models will naturally have larger entropy than Deep ensembles. While it is expected that the Deep ensemble would have different behaviour on the SVHN OOD set, it is surprising to observe how well H2DGauss aligns with its S2D Deep ensemble teacher. An individual S2D model was also able to generate closely related total and data uncertainty estimates, but suffers significantly in producing consistent knowledge uncertainties. These results raise the question if a Gaussian student could capture the diversity in a vanilla Deep ensemble by modelling the logits, in a similar fashion to how H2DGauss models its teacher—a possible avenue for future work.
Appendix D Additional Experiments: WideResNet
Following the DenseNetBC experiments in section 5 we repeated them with a different architecture. In this section we focus on a significantly larger WideResNet [wideresnet] model with a depth of 28 and a widening factor of 10. The standard and S2D models were both trained as described in wideresnet, with the S2D specific parameters being the same as previously described. The only difference is that teacher predictions were generated using multiplicative Gaussian noise with a fixed standard deviation of .
The H2DGauss model was also trained in a different manner. First, it was initialised from an S2D model trained for 150 epochs. Thereafter it was trained for an additional 80 epochs with a starting learning rate of which was reduced by a factor of 4 after 60 epochs. For this section, EnD and H2DDir were not investigated.
Table 14 shows test set performance. Unlike previous experiments, S2D was not able to outperform an individual model by more than two standard deviations, in this case achieving around one standard deviation improvement in accuracy. Interestingly, the MC approach has worse accuracy for both the standard and S2D case, however this could be due to the small number of drawn samples ().
Dataset  C100  C100+  
Model  Acc.  NLL  %ECE  Acc.  NLL  %ECE 
Individual  73.9 0.5  1.05 0.02  5.26 0.78  81.1 0.3  0.76 0.01  5.21 0.44 
S2D Individual  74.2 0.5  1.06 0.05  5.48 2.25  81.3 0.3  0.74 0.01  4.24 0.74 
MC ensemble  73.6 0.5  1.05 0.03  4.70 0.88  81.0 0.5  0.74 0.01  3.29 0.36 
S2D MC ensemble  73.8 0.4  1.03 0.04  2.95 1.01  81.0 0.3  0.73 0.01  1.99 0.35 
Deep ensemble  77.1  0.88  5.08  83.4  0.63  2.27 
S2D Deep ensemble  77.9  0.86  4.52  83.6  0.63  1.84 
H2DGauss  77.4  0.95  5.19  82.8  0.71  2.45 
Furthermore, both Deep ensembles significantly outperform their individual equivalents with the S2D version being slightly better in all measured performance metrics. The notable result in this table is the high performance of H2DGauss, able to outperform the Deep ensemble in C100 and achieve near ensemble performance in C100+.
In the OOD detection task we observe that both versions of the MC ensemble struggle to outperform their individual counterparts. There also seems to be a disparity in performance when comparing resize and random cropped LSUN and TIM. With random crops, all S2D systems notably outperform their standard counterparts. In this case both S2D Individual and H2DGauss were able to outperform the Deep ensemble using any uncertainty metric. In the other case of resizing LSUN and TIM images and in SVHN the detection performance difference is smaller but the S2D Deep ensemble still remains the best model with both H2DGauss and Deep ensemble performing similarly.