Ensemble Distribution Distillation

04/30/2019 ∙ by Andrey Malinin, et al. ∙ University of Cambridge 0

Ensemble of Neural Network (NN) models are known to yield improvements in accuracy. Furthermore, they have been empirically shown to yield robust measures of uncertainty, though without theoretical guarantees. However, ensembles come at high computational and memory cost, which may be prohibitive for certain application. There has been significant work done on the distillation of an ensemble into a single model. Such approaches decrease computational cost and allow a single model to achieve accuracy comparable to that of an ensemble. However, information about the diversity of the ensemble, which can yield estimates of knowledge uncertainty, is lost. Recently, a new class of models, called Prior Networks, has been proposed, which allows a single neural network to explicitly model a distribution over output distributions, effectively emulating an ensemble. In this work ensembles and Prior Networks are combined to yield a novel approach called Ensemble Distribution Distillation (EnD^2), which allows distilling an ensemble into a single Prior Network. This allows a single model to retain both the improved classification performance as well as measures of diversity of the ensemble. In this initial investigation the properties of EnD^2 have been investigated and confirmed on an artificial dataset.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Networks (NNs) have emerged as the state of the art approach to a variety of machine learning tasks


in domains such as computer vision

Girshick2015 ; vgg ; videoprediction

, natural language processing

embedding1 ; embedding2 ; mikolov-rnn , speech recognition dnnspeech ; DeepSpeech and bio-informatics Caruana2015 ; dnarna

. Despite impressive supervised learning performance, NNs tend to make over-confident predictions

deepensemble2017 and, until recently, have been unable to provide measures of uncertainty in their predictions. As NNs are increasingly being applied to safety-critical tasks such as medical diagnosis de2018clinically , biometric identification schroff2015facenet and self driving cars, estimating uncertainty in model’s predictions is crucial, as it enables the safety of an AI system aisafety to be improved by acting on the predictions in an informed manner. Ensembles of NNs are known to yield increased accuracy over a single model murphy , allow useful measures of uncertainty to be derived deepensemble2017 , and also provide defense against adversarial attacks gal-adversarial . Ensemble approaches have also been successfully applied in the area of speech recognition ensemble-asr ; ensemble-asr2 . There is both a range of Bayesian Monte-Carlo approaches Gal2016Dropout ; langevin ; fast-ensembling-vetrov , as well as non-Bayesian approaches, such as random-initialization and bagging, to generating ensembles. Crucially, ensemble approaches allow total uncertainty in predictions to be decomposed into knowledge uncertainty and data uncertainty. Data uncertainty is the irreducible uncertainty in predictions which arises due to the complexity, multi-modality and noise in the data. Knowledge uncertainty, also known as epistemic uncertainty galthesis or distributional uncertainty malinin-pn-2018 , is uncertainty due to a lack of understanding or knowledge on the part of the model regarding the current input for which the model is making a prediction. In other words, this form of uncertainty arises when the test input comes from a different distribution than the one that generated the training data. This is why the name distributional uncertainty is also used. Mismatch between the test and training distributions is also known as a dataset shift Datasetshift , and is a situation which often arises for real world problems. Distiguinshing between sources of uncertainty is important, as in certain machine learning applications it may be necessary to know not only whether the model is uncertain, but also why

. For instance, in active learning, additional training data should be collected from regions with high

knowledge uncertainty, but not data uncertainty. A fundamental limitation of ensembles is that the computational cost of training and inference can be many times greater than that of a single model. One solution is to distill an ensemble of models into a single network to yield the mean predictions of the ensemble dark-knowledge ; bayesian-dark-knowledge . However, this collapses an ensemble of conditional distributions over classes into a single point-estimate conditional distribution over classes. As a result, knowledge about the diversity of the ensemble is lost. This prevents measures of knowledge uncertainty, such as mutual information malinin-pn-2018 ; mutual-information , from being estimated. A new class of models was recently introduced, known as Prior Networks malinin-pn-2018 , which explicitly model a conditional distribution over categorical distributions by parameterizing a Dirichlet distribution. This effectively allows these models to emulate an ensemble. Prior Networks were shown to achieve excellent results for out-of-distribution input detection and misclassification detection. In this work, we investigate the Distillation of an Ensemble of Models into a Prior Network, referred to as Ensemble Distribution Distillation (EnD), as a way to both preserve the distributional information of an ensemble and improve the performance of a Prior Network. The contributions of this work are as follows. Firstly, the distillation of an ensemble of neural network models into a single Prior Network is investigated on artificial data, which allows the behaviour of the ensemble to be visualized. It is shown that distribution-distilled Prior Networks are able to distinguish between data uncertainty and knowledge uncertainty. Secondly, EnD is evaluated on CIFAR-10 on the task of identifying out-of-distribution (OOD) samples, where it outperforms standard DNNs and regular Ensemble Distillation (EnD), approaching the performance of the original ensemble.

2 Ensembles

In this work, a Bayesian viewpoint on ensembles is adopted, as it provides a particularly elegant probabilistic framework, which allows knowledge uncertainty to be linked to Bayesian model uncertainty. However, it is also possible to construct ensembles using a range of non-Bayesian approaches. For example, it is possible to explicitly construct an ensemble of models by training on the same data with different random seeds deepensemble2017 and/or different model architectures. Alternatively, it is possible to generate ensembles via Bootstrap methods murphy ; bootstrap-dqn in which each model is trained on a re-sampled version of the training data. The essence of Bayesian methods is to treat the model parameters

as random variables and place a prior distribution

over them to compute a posterior distribution over models via Bayes’ rule:


Here, model uncertainty is captured in the posterior distribution . Consider an ensemble of models sampled from the posterior:


where are the parameters of a categorical distribution . The expected predictive distribution for a test input is obtained by taking the expectation with respect to the model posterior:


Each of the models yields a different estimate of data uncertainty. Uncertainty in predictions due to model uncertainty is expressed as the level of spread, or ’disagreement’, of an ensemble sampled from the posterior. The aim is to craft a posterior , via an appropriate choice of prior , which yields an ensemble that exhibits the desired set of behaviours described in figure 1.

(a) Confident Prediction
(b) Data Uncertainty
(c) Knowledge Uncertainty
Figure 1: Desired Behaviors of a Ensemble.

Specifically, for an in-domain test input , the ensemble should produce a consistent set of predictions with little spread, as described in figure 0(a) and figure 0(b). In other words, the models should agree in their estimates of data uncertainty. On the other hand, for inputs which are different from the training data, the models in the ensemble should ’disagree’ and produce a diverse set of predictions, as shown in figure 0(c). Ideally, the models should yield increasingly diverse predictions as input moves further away from the training data. If an input is completely unlike the training data, then the level of disagreement should be significant. Hence, the measures of model uncertainty will capture knowledge uncertainty given an appropriate choice of prior. Given an ensemble which exhibits the desired set of behaviours, the entropy of the expected distribution can be used as a measure of total uncertainty in the prediction. Uncertainty in predictions due to model uncertainty can be assessed via measures of the spread, or ’disagreement’, of the ensemble such as Mutual Information:


This formulation of mutual information allows the total uncertainty to be decomposed into model uncertainty and expected data uncertainty. The entropy of the expected distribution, or total uncertainty, will be high whenever the model is uncertain - both in regions of severe class overlap and out-of-domain. However, the difference of the entropy of the expected posterior and the expected entropy of the posterior will be non-zero only if the models disagree. For example, in regions of class overlap, each member of the ensemble will yield a high entropy posterior (figure 1b) - the entropy of the expected and the expected entropy will be similar and mutual information will be low. In this situation total uncertainty is dominated by data uncertainty. On the other hand, for out-of-domain inputs the ensemble yields diverse posterior distributions over classes such that the expected posterior over classes is near uniform (figure 0(c)), while the expected entropy may be much lower. In this region of input space the models’ understanding of data is low and the estimates of expected data uncertainty are poor.

3 Ensemble Distribution Distillation

Previous work dark-knowledge ; bayesian-dark-knowledge has investigated distilling a single large network into a smaller one and an ensemble of networks into a single neural network. In general, this is done by minimizing the KL-divergence between the model and the expected predictive distribution of an ensemble:


This approach essentially aims to train a single model that captures the mean of an ensemble, allowing the model to achieve a higher classification performance at a far lower computational cost. However, the limitation of this approach with regards to uncertainty estimation is that the information about the diversity of the ensemble is lost. As a result, it is no longer possible to decompose total uncertainty into knowledge uncertainty and data uncertainty via mutual information as in equation 4. In this section we propose an approach called Ensemble Distribution Distillation (EnD) that allows a single model to capture not only the mean of an ensemble, but also its diversity. An ensemble of models can be viewed as a set of samples from an implicit distribution of output distributions:


Recently, a new class of models was proposed, called Prior Networks malinin-pn-2018 , which explicitly parameterize a conditional distribution over output distributions using a single neural network. Thus, a Prior Network is able to effectively emulate an ensemble, and therefore yield the same measures of uncertainty. A Prior Network models a distribution over categorical output distributions by parameterizing the Dirichlet distribution.


The distribution is parameterized by its concentration parameters , which can be obtained by placing an exponential function at the output of a Prior Network: . In this work we consider how an ensemble, which is a set of samples from an implicit distribution over distributions, can be distribution distilled into an explicit distribution over distributions modelled using a single Prior Network model, ie: This is accomplished in several steps. A transfer dataset is composed of the inputs from the original training set and the categorical distributions derived from the ensemble for each input. Given this transfer set, the Prior Network is trained by minimizing the negative log-likelihood of each categorical distribution :


Thus, Ensemble Distribution Distillation is a straightforward application of maximum-likelihood estimation to Prior Network models. Given a distribution-distilled Prior Network, the predictive distribution is given by the expected categorical distribution under the Dirichlet prior:


Separable measures of uncertainty can be obtained by considering the mutual information between the prediction and the parameters of of the categorical:


Similar to equation 4, this expression allows total uncertainty, given by the entropy of the expected distribution, to be decomposed into data uncertainty and knowledge uncertainty. If Ensemble Distribution Distillation is completely successful, then the measures of uncertainty derivable from a distribution-distilled Prior Network should be identical to those derived from the original ensemble.

3.1 Temperature Annealing

Minimization of the negative log-likelihood of the model on the transfer dataset is equivalent to minimization of the KL-divergence between the model and the empirical distribution . As previously discussed, this distribution is often ’sharp’ on the training data, which limits the common support between the output distribution of the model and the target empirical distribution. Optimization of the KL-divergence between distributions with limited non-zero common support is particularly difficult. To alleviate this issue, the proposed solution is to use temperature to ’heat up’ both distributions and increase common support. The empirical distribution is heated up by raising the temperature of the softmax of each model in the ensemble in the same way as in the original paper on ensemble distillation dark-knowledge . The output Dirichlet distribution of the EnD model is heated up by raising the temperature of the output exponential function which yields concentration parameters: . As training progresses, it is necessary to re-emphasize the difference between the empirical distribution and the model in order to provide for a tighter fit. To do this, a temperature annealing schedule is used, where training begins with a high initial temperature, which is lowered back down to as the training progresses.

4 Experiments on Artificial Data

(a) Spiral Dataset
(b) Spiral Dataset with OOD data ()
Figure 2: 3-spiral dataset with 1000 examples per class

The current section investigates Ensemble Distribution Distillation (EnD) on an artificial dataset shown in figure 2a. This dataset consists of 3 spiral arms extending from the center with both increasing noise and increasing distance between each of the arms. Each arm corresponds to a single class. This dataset is chosen such that it is not linearly separable and requires a powerful model to correctly model the decision boundaries, and also such that there are definite regions of class overlap. In the following set of experiments, an ensemble of 100 neural networks is constructed by training neural networks from 100 different random initializations. A smaller (sub) ensemble of only 10 neural networks is also considered. The models are trained on 3000 data-points sampled from the spiral dataset, with 1000 examples per class. The classification performance of EnD is compared to the performance of individual neural networks, the overall ensemble and Ensemble Distillation (EnD). The results are presented in table 1.

Num. models Individual Ensemble EnD EnD
10 86.79 87.37 87.43 87.48
100 87.63 87.61 87.53
Table 1: Classification Performance (% Accuracy) on of size 1000, trained on of size 1000 with 3 spiral classes and noise . Dataset sizes given as number of examples per class.

The results show that an ensemble of 10 models has a clear performance gain compared to the mean performance of the individual models. An ensemble of 100 models has a smaller performance gain over an ensemble of only 10 models. Ensemble Distillation (EnD) is able to recover the classification performance of both an ensemble of 10 and 100 models with only very minor degradation in performance. Finally, Ensemble Distribution Distillation is also able to recover most of the performance gain of an ensemble, but with a slightly larger degradation. This is likely due to forcing a single model to learn not only the mean, but also the distribution around it, which likely requires more capacity from the network.

(a) Ensm. Total Uncertainty
(b) Ensm. Data Uncertainty
(c) Ensm. Knowledge Uncertainty
(d) EnD Total Uncertainty
(e) EnD Data Uncertainty
(f) EnD Knowledge Uncertainty
Figure 3: Comparison of measures of uncertainty derived from an Ensemble and EnD.

The measures of uncertainty derived form an ensemble of 100 models and from Ensemble Distribution Distillation are presented in figure 3. The results show that EnD successfully captures data uncertainty and also correctly decomposes total uncertainty into knowledge uncertainty and data uncertainty. However, it fails to appropriately capture knowledge uncertainty further away from the training region, as there are obvious dark holes in figure 3f, where the model yields low knowledge uncertainty far from the region of training data.

Distillation Data Individual Ensemble EnD EnD
86.79 87.63 87.61 87.53
+ 87.59 87.50
Table 2: Classification Performance (% Accuracy) on , trained on either or with 3 classes and noise . All datasets are of size 1000. Data for an ensemble of a 100 models.

In order to overcome these issues, a thick ring of inputs far from the training data was sampled as depicted in figure 2b. The predictions of the ensemble were obtained for these input points and used as additional out-of-distribution training data . Table 2 shows how using the training data affects the performance of the Ensemble Distillation and Ensemble Distribution Distillation. There is a minor drop in performance of both distillation approaches. However, the overall level of performance is not compromised and is still higher than the average performance of each individual DNN model. The behaviour of measures of uncertainty derived from Ensemble Distribution Distillation with additional out-of-distribution training data is shown in figure 4.

(a) EnD Total Uncertainty
(b) EnD Data Uncertainty
(c) EnD Knowledge Uncertainty
Figure 4: Measures of uncertainty derived from EnD using training and OOD data.

These results show that given the out-of-distribution behaviour of an ensemble is explicitly distribution-distilled into a Prior Network, EnD is able to successfully capture that behaviour. However, Ensemble Distribution Distillation will not necessarily always capture out-of-distribution behaviour based purely on the in-domain behaviour of an ensemble. This is likely compounded by the fact that the diversity of an ensemble on training data that the model has seen is smaller than the diversity of the ensemble on a heldout test-set.

5 Experiments on Image Data

Having confirmed the properties of EnD on an artificial dataset, we now investigate EnD on a real image datasets - CIFAR-10 and CIFAR-100 cifar - to assess its ability to distill uncertainty metrics from an ensemble on a more practical task, while retaining the desirable classification performance advantage of Ensemble Distillation. Similarly to section 4, an ensemble of a 100 models is constructed by training NNs on CIFAR-10/100 data from different random initializations. The transfer dataset

is constructed from CIFAR-10/100 inputs and ensemble logits to allow for

temperature annealing during training. In addition, we also consider Ensemble Distillation and Ensemble Distribution Distillation on a transfer set that contains both the original CIFAR-10/100 training data and out-of-domain data taken from the other dataset111In other words, OOD training data for CIFAR-10 is CIFAR-100, and vice versa., termed EnD and EnD respectively. As Ensemble Distribution Distillation is a more complex task compared to standard Ensemble Distillation, both the EnD and EnD models have been trained for epochs on CIFAR-10, in contrast to the ensemble models, each of which has been trained for epochs. The training configurations of EnD and EnD on CIFAR-100 are identical to the configuration for CIFAR-10 models, which may be sub-optimal. It is important to note that the OOD data has been treated in the same way as ID data during construction of the transfer set and distillation. This offers an advantage over traditional Prior Network training, where the knowledge of which examples are ID and OOD is required a-priori. In Ensemble Distribution Distillation, the models can be distilled using any (potentially unlabeled) data on which ensemble predictions can be obtained.

Dataset Individual Ensemble EnD EnD EnD EnD
CIFAR-10 92.0 93.8 93.3 93.3 92.7 93.1
CIFAR-100 69.6 73.7 71.4 71.7 71.3 71.8
Table 3: Classification Performance (% Accuracy) on CIFAR-10. For EnD, EnD and EnD an average of three models is presented together with

the standard deviation.

As seen in Table 3, Ensemble Distillation (EnD) is again able to recover the classification performance of the ensemble with only a minor degradation in performance. Ensemble Distribution Distillation is also able to recover most of the performance gain of the ensemble, but with a slightly larger degradation. When trained on a mix of ID and OOD data, the classification performance of EnD improves marginally. This is likely due to the regularizing effect of OOD data on which the ensemble predictions are more spread-out. However, both EnD and EnD models recover less of the ensemble’s classification performance advantage on CIFAR-100 compared to CIFAR-10. This is, however, likely due to the fact that the same training configuration was used on a more complex task. A better training configuration could yield improvements in both classification.

Test OOD Model CIFAR-10 CIFAR-100 Dataset Total Knowledge Total Knowledge Uncertainty Uncertainty Uncertainty Uncertainty LSUN Individual 91.3 - 75.6 - EnD 88.8 - 73.1 - EnD 89.0 - 76.5 - EnD 92.6 91.5 81.0 83.7 EnD 94.4 95.3 83.5 86.9 Ensemble 94.5 N/A 94.4 N/A 82.4 N/A 88.4 N/A TIM Individual 88.9 - 70.5 - EnD 86.6 - 66.8 - EnD 86.9 - 70.0 - EnD 88.7 87.4 73.6 76.0 EnD 91.3 91.8 76.4 79.3 Ensemble 91.8 N/A 91.4 N/A 76.6 N/A 81.7 N/A

Table 4: OOD detection performance (% AUC-ROC) for CIFAR-10 and CIFAR-100 models using different measures of uncertainty.

Ensemble Distribution Distillation was also evaluated on an out-of-domain detection task in which uncertainty metrics from the models and the ensemble are solely used to classify inputs as either OOD or ID. The ID examples are the test set of CIFAR-10/100, and the test OOD examples are chosen to be the test sets of LSUN and TIM. Table

4 shows the results for the out-of-domain detection task. As expected, the measures of uncertainty derived from the ensemble outperform those from a single neural network. EnD and EnD clearly fail to capture those gains. On the other hand, EnD is able to reproduce the advantage from using the ensemble. When OOD data is used for training, Ensemble Distribution Distillation is able to perform on par with the ensemble, indicating that it has successfully learned how the distribution of the ensemble behaves on unfamiliar data. The difference in OOD detection performance between EnD and the ensemble is greater on CIFAR-100 than on CIFAR-10, which suggests that as the task becomes more complex, EnD yields worse uncertainty estimates, highlighting the advantage of EnD. Throughout this section, a Prior Network that parametrizes a Dirichlet was used for distribution-distilling the ensemble. However, the output distributions of an ensemble for the same input are not necessarily Dirichlet-distributed, especially in regions where the ensemble is diverse. To check how well EnD captures the ensemble distribution, a histogram showing the example count with a given uncertainty for both ID and test OOD data (a concatenation of LSUN and TIM) is shown in figure 5. On in-domain data EnD is seemingly able to emulate the uncertainty metrics of the ensemble well. Despite EnD performing very well on the out-of-domain detection task, however, there is a noticeable mismatch between the ensemble and EnD in the uncertainty metrics they provide. This is expected, as on in-domain examples the samples from the ensemble will be highly concentrated around the mean, behaviour which can be adequately modelled by a Dirichlet. On out-of-domain data, however, the samples from the ensemble might be diverse in a way that’s different from a Dirichlet. For instance, the distribution could be multimodal. It is possible that the ensemble could be better modelled by a different output distribution, such as a mixture of Dirichlets, and is an interesting direction for future work.

(a) Mutual Information - ID
(b) Mutual Information - OOD
Figure 5: Histograms of mutual information of the CIFAR-10 ensemble and EnD on in-domain (ID) and test out-of-domain (OOD) data.

6 Conclusion

Ensemble Distillation approaches have become popular, as they allow a single model to achieve classification performance comparable to that of an ensemble at a lower computational cost. This work proposes distilling an ensemble into a single Prior Network model, such that it exhibits the improved classification performance of the ensemble and retains information about its diversity. This approach is referred to as Ensemble Distribution Distillation. Experiments described in section 4 and 5 show that on both artificial and CIFAR-10 data it is possible to distill an ensemble into a single neural network such that the Prior Network retains the classification performance of the ensemble. Furthermore, it is shown that the measures of uncertainty provided by EnD match the behaviour of an ensemble of models on artificial data, and the model is able to differentiate between different types of uncertainty. However, this required obtaining out-of-distribution inputs on which the ensemble is highly diverse in order to allow the distribution-distilled model to learn appropriate out-of-distribution behaviour. On CIFAR-10, the uncertainty metrics derived from EnD allow the model to outperform both single NNs and EnD models, and when out-of-domain data is used for training, EnD even matches the performance of the ensemble. These results are promising, and show that Ensemble Distribution Distillation enables a single model to capture useful properties of the ensemble, but at a significantly reduced computational cost.


  • (1) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton,

    Deep learning,”

    nature, vol. 521, no. 7553, pp. 436, 2015.
  • (2) Ross Girshick, “Fast R-CNN,” in Proc. 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448.
  • (3) Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in Proc. International Conference on Learning Representations (ICLR), 2015.
  • (4) Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee, “Learning to Generate Long-term Future via Hierarchical Prediction,” in Proc. International Conference on Machine Learning (ICML), 2017.
  • (5) Tomas Mikolov et al., “Linguistic Regularities in Continuous Space Word Representations,” in Proc. NAACL-HLT, 2013.
  • (6) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient Estimation of Word Representations in Vector Space,” 2013, arXiv:1301.3781.
  • (7) Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur,

    Recurrent Neural Network Based Language Model,”

    in Proc. INTERSPEECH, 2010.
  • (8) Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” Signal Processing Magazine, 2012.
  • (9) Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng, “Deep speech: Scaling up end-to-end speech recognition,” 2014, arXiv:1412.5567.
  • (10) Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission,” in Proc. 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2015, KDD ’15, pp. 1721–1730, ACM.
  • (11) Babak Alipanahi, Andrew Delong, Matthew T. Weirauch, and Brendan J. Frey, “Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning,” Nature Biotechnology, vol. 33, no. 8, pp. 831–838, July 2015.
  • (12) B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles,” in Proc. Conference on Neural Information Processing Systems (NIPS), 2017.
  • (13) Jeffrey De Fauw, Joseph R Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, Sam Blackwell, Harry Askham, Xavier Glorot, Brendan O’Donoghue, Daniel Visentin, et al., “Clinically applicable deep learning for diagnosis and referral in retinal disease,” Nature medicine, vol. 24, no. 9, pp. 1342, 2018.
  • (14) Florian Schroff, Dmitry Kalenichenko, and James Philbin,

    “Facenet: A unified embedding for face recognition and clustering,”


    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2015, pp. 815–823.
  • (15) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané, “Concrete problems in AI safety,” http://arxiv.org/abs/1606.06565, 2016, arXiv: 1606.06565.
  • (16) Kevin P. Murphy, Machine Learning, The MIT Press, 2012.
  • (17) L. Smith and Y. Gal, “Understanding Measures of Uncertainty for Adversarial Example Detection,” in UAI, 2018.
  • (18) J. H. M. Wong and M. J. F. Gales, “Multi-task ensembles with teacher-student training,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 84–90.
  • (19) Y. Wang, J. H. M. Wong, M. J. F. Gales, K. M. Knill, and A. Ragni, “Sequence teacher-student training of acoustic models for automatic free speaking language assessment,” in 2018 IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 994–1000.
  • (20) Yarin Gal and Zoubin Ghahramani, “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning,” in Proc. 33rd International Conference on Machine Learning (ICML-16), 2016.
  • (21) Max Welling and Yee Whye Teh, “Bayesian Learning via Stochastic Gradient Langevin Dynamics,” in Proc. International Conference on Machine Learning (ICML), 2011.
  • (22) Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson, “Loss surfaces, mode connectivity, and fast ensembling of dnns,” in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. 2018, pp. 8803–8812, Curran Associates, Inc.
  • (23) Yarin Gal, Uncertainty in Deep Learning, Ph.D. thesis, University of Cambridge, 2016.
  • (24) Andrey Malinin and Mark Gales, “Predictive uncertainty estimation via prior networks,” in Advances in Neural Information Processing Systems, 2018, pp. 7047–7058.
  • (25) Joaquin Quiñonero-Candela, Dataset Shift in Machine Learning, The MIT Press, 2009.
  • (26) Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015.
  • (27) Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max Welling, “Bayesian dark knowledge,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. 2015, pp. 3438–3446, Curran Associates, Inc.
  • (28) Stefan Depeweg, José Miguel Hernández-Lobato, Finale Doshi-Velez, and Steffen Udluft,

    “Decomposition of uncertainty for active learning and reliable reinforcement learning in stochastic systems,”

    stat, vol. 1050, pp. 11, 2017.
  • (29) Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy, “Deep exploration via bootstrapped dqn,” in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, Eds., pp. 4026–4034. Curran Associates, Inc., 2016.
  • (30) Alex Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
  • (31) Martín Abadi et al.,

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems,” 2015,

    Software available from tensorflow.org.
  • (32) Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization,” in Proc. 3rd International Conference on Learning Representations (ICLR), 2015.

Appendix A Datasets

Domain Dataset Train Valid Test Classes In-Domain CIFAR-10 50000 - 10000 10 Out-of-Domain - Distillation CIFAR-100 50000 - 10000 100 Out-of-Domain - Test LSUN - - 10000 10 TinyImagenet - - 10000 200

Table 5: Description of in-domain and out-of-domain datasets used for distillation and testing in terms of number of images and classes.

Appendix B Model architecture and training

Training Model General Distillation Dataset Epochs Cycle len. Dropout Annealing OOD data CIFAR-10 DNN 45 30 0.5 - - - EnD 90 60 0.7 No - EnD 90 60 0.7 No CIFAR-100 EnD 90 60 0.7 Yes - EnD 90 60 0.7 Yes CIFAR-100 CIFAR-100 DNN 100 70 0.5 - - - EnD 90 60 0.7 No - EnD 90 60 0.7 No CIFAR-10 EnD 90 60 0.7 Yes - EnD 90 60 0.7 Yes CIFAR-10

Table 6: Training Configurations. is the initial learning rate,

is the initial temperature and ’Annealing’ refers to whether a temperature annealing schedule was used. The batch size for all models was 128. Dropout rate is quoted in terms of probability of

not dropping out a unit.

All models considered in this work were implemented in Tensorflow tensorflow using a variant of the VGG vgg architecture for image classification. DNN and EnD models were trained using the negative log-likelihood loss of the labels and the mean ensemble predictions respectively. EnD models were trained using the negative log-likelihood of the ensemble’s output categorical distributions. All models were trained using the Adam adam optimizer, with a 1-cycle learning rate policy and dropout regularization. For all ensembles, models were trained starting from different random seeds and using different random seeds for shuffling the data. In addition, data augmentation was applied when training models on the CIFAR-10 and CIFAR-100 datasets via random left-right flips, random shifts up to 4 pixels and random rotations by up to 15 degrees. The details of the training configurations for all models can be found in table 6. Temperature of was used for Ensemble Distillation as recommended in dark-knowledge , and we’ve found that it yielded the best classification performance out of . Temperature annealing resulted in worse classification performance for Ensemble Distillation, and hence was not used in the experiments. Equivalently, for Ensemble Distribution Distillation, we found that initial temperature of performed best out of . Furthermore, batch normalisation was used for both Ensemble Distillation and Ensemble Distribution Distillation. To create the transfer set, ensembles were evaluated on the unaugmented CIFAR-10 and CIFAR-100 training examples. During distillation (both EnD and EnD), models were trained on the augmented examples with the ensemble predictions on the corresponding unaugmented inputs.