1 Introduction
With deep learning models increasingly being deployed in safety critical applications, estimating a model’s uncertainty in its predictive distribution has become a more pressing problem
[gal2016uncertainty]. The uncertainty of a statistical model can be divided into aleatoric and epistemic uncertainty [kendall2017uncertainties]:
Aleatoric uncertainty
captures stochasticity in the labels for supervised learning tasks. This uncertainty could be the result of noisy measurements, mislabelled samples, unobserved factors that influence the labels, and so on.

Epistemic uncertainty
captures uncertainty about the model that generated our data. For neural networks, this is often thought of as the uncertainty in the chosen architecture’s parameters but could also include uncertainty in the architecture itself.
The predictive uncertainty of a model is the combination of its aleatoric and epistemic uncertainty. In this paper we address the modelling of aleatoric uncertainty for classification tasks. Our approach can be easily combined with many approaches in the suite of Bayesian neural networks that estimate the epistemic uncertainty of a model [gal2016dropout, gal2017concrete, blundell2015weight]
resulting in an estimate of the full predictive uncertainty. However, this is not the focus of this work. Aleatoric uncertainty can be either characterized as homoscedastic or heteroscedastic uncertainty:

Homoscedastic: the uncertainty in the labels is constant across the input space i.e. all labels are equally noisy.

Heteroscedastic: the aleatoric uncertainty varies across the input space, e.g. certain examples may be more difficult to label manually than others.
If a dataset has heteroscedastic noise, then modelling heteroscedasticity is crucial for calibrated uncertainty quantification. In addition, if the data’s “true” model is heteroscedastic, then maximum likelihood estimation of the parameters of a nonlinear homoscedastic model is biased and inconsistent [greene2012econometric]. Thus, for datasets with such uncertainty, improved modelling of heteroscedasticity promises not only better calibrated uncertainty quantification but also less biased parameter estimation and consequently improved predictive performance as measured by some metric of interest e.g. accuracy. The main contributions of this paper are:

Provide a theoretical framework for the stateoftheart neural network heteroscedastic classification method by establishing a connection to a well studied method in the econometrics literature.

Via this framework, provide a means of understanding the importance of the softmax temperature parameter in controlling an approximation to the objective under the true model.

Provide empirical evidence demonstrating significantly improved predictive performance on a range of tasks by tuning this temperature parameter vs. implicitly setting it to 1.0.
2 Background
In order to motivate our development of heteroscedastic classification models we shall first review one formulation of a heteroscedastic regression model as it is particularly amenable to interpretation [kendall2017uncertainties].
2.1 Heteroscedastic Regression Models
We have a dataset of examples: where is real valued. We assume that are i.i.d. such that , where and
are any parametric models parameterized by
. The loglikelihood of our data is:(1) 
If we set this reduces to a standard homoscedastic regression model. However for a nonconstant function , this differs from a standard regression model in that the squared error loss for each example is weighted by
. Those examples with higher predicted aleatoric uncertainty will have less weight in the loglikelihood being maximized and consequently the model will focus less on trying to explain them well. It is clear how such a loglikelihood can make a model more robust to “outliers” and noisy labels.
2.2 Heteroscedastic Classification Models
Following on from their development of a neural network heteroscedastic regression model, [kendall2017uncertainties]
develop a similarly structured heteroscedastic classification model, which is the stateoftheart approach for neural network heteroscedastic classification. They place a Gaussian distribution on the logits of a standard softmax classification model, making the logits latent variables:
(2) 
where
is the probability the label is class
and is the number of classes. The model’s loglikelihood and a Monte Carlo method for estimating it are also developed, see [kendall2017uncertainties] for details.3 Latent Variable Heteroscedastic Classification Models
Heteroscedastic classification models have been widely studied in the econometric literature as a branch of discrete choice models [train2009discrete, mcfadden1973conditional, bhat1995heteroscedastic, munizaga2000representation]. It is typically assumed that there is some utility associated with each class. Utility is decomposed into reference utility which is a function of observed variables and an unobserved stochastic component . The probability that class is selected is the probability that the latent utility is greater than the corresponding value for all other classes:
(3) 
In the econometric literature, the reference utility model is often chosen to be a linear function. Restricting ourselves briefly to homoscedastic models, if we assume is i.i.d. then this latent variable formulation leads to a multinomial logistic model which is exactly equivalent to the softmax classification models with crossentropy loss widely used with neural networks [train2009discrete]. Similarly if we choose i.i.d. this leads to a multinomial probit model with identity covariance matrix. Heteroscedastic models however break the assumption that the additive noise terms are identically distributed, though they are still assumed to be independent. Computing or its gradient involves an integral over for which no analytic solutions exist in the heteroscedastic case [train2009discrete, bhat1995heteroscedastic]. Similar latent variable formulations of heteroscedastic classification models have also been developed in the Gaussian Process literature where naturally it is assumed the latent noise is distributed Normally [hernandez2014mind, williams2006gaussian] and a GP prior is placed on and . Again exact inference on the likelihood is intractable and approximate inference methods are used [hernandez2014mind]. Having seen this latent variable formulation of heteroscedastic classification models we can now revisit the the neural network model introduced in Eq. (2), and attempt to cast it similarly as a latent variable model:
(4) 
But if we now assume that the probability of the label being class is the probability of the latent for class being maximum, then in general we do not get the same as assumed in [kendall2017uncertainties]:
(5) 
Thus when we interpret the heteroscedastic classification model from [kendall2017uncertainties] as a latent variable model, which it implicitly is, we see that the computation of the predictive probabilities are not theoretically justified.
3.1 LocationScale Family Latent Variable Heteroscedastic Classification
We can generalize latent variable heteroscedastic models by allowing the distribution on to be any distribution including but not limited to the Normal. As will be seen later, in the deep learning context it will be beneficial to restrict the distribution on to be any locationscale family of distributions e.g. Normal, Gumbel, Logistic. Each member of this family is parameterized by a location parameter and a scale parameter . In terms of our latent variable heteroscedastic model above we allow where is a locationscale family.
(6) 
is also a member of by the properties of a locationscale family and once again the probability that a particular class is chosen is simply the probability that the latent variable is the maximum of all latent variables .
(7) 
4 Deep Heteroscedastic Classification as a Special Case of the Smoothed AcceptReject Simulator
We have seen that the existing stateoftheart neural network heteroscedastic model is inexact when interpreted as a latent variable model. We will show that this model is a special case of a well studied approximation in the econometric literature.
4.1 Softmax Approximation
Computing the value of Eq. (7) requires an integral over which in general we cannot compute analytically [train2009discrete]. Eq. (7) is an expectation over the unobserved component of utility with can be estimated with Monte Carlo methods. However direct Monte Carlo estimation still has the issue that the argmax function’s derivatives are either zero or undefined, everywhere in its domain. Therefore we seek a smooth approximation to Eq. (7), similar to the development of the GumbelSoftmax/Concrete distribution [gumbelsoftmax2017, concrete2017]. We note that in a zero temperature limit the softmax function is equivalent to the argmax, hence:
(8) 
where the expectation above is over , and is a 01 indicator function. A similar result for the binary classification case with sigmoid smoothing function is derived in appendix A. In the econometric literature this approximation is known as the logitsmoothed acceptreject simulator [train2009discrete, mcfadden1989method]. When is chosen to be Normal it is known as the “logitkernel probit” [bolduc1996multinomial]. In the context of heteroscedastic models, the smoothed acceptreject simulator can be viewed as a smooth approximation to the true model, where the approximation is exact in a zero temperature limit. While on the surface, this approximation seems similar to the GumbelSoftmax trick, is only distributed according to the Concrete distribution iff , which would be a homoscedastic model. In our case, we are free to choose any location scale family . And as we are particularly interested in heteroscedastic models, we do not restrict to be a constant function, typically choosing to parameterize as a deep neural network. Despite not motivating it as such, the neural network heteroscedastic classification model with Gaussian noise on the softmax logits, Eq. (4), is in fact a special case of the smoothed acceptreject simulator with and . However the authors do not motivate their model as an approximation to a true model or recognise the importance of the temperature parameter in controlling that approximation. As both and are in a locationscale family, they can be defined in terms of a deterministic locationscale transformation , of . We can thus apply the reparametrisation trick [diederik2014auto] allowing , , which implies:
(9) 
A Monte Carlo estimate of the approximate predictive probabilities can be obtained:
(10) 
Note that, due to the reparamertization, gradients for Eq. (10) w.r.t. can be taken. Once we have computed and , computing the Monte Carlo estimate i.e., Eq. (10) with samples, has computational cost . This computational cost is typically trivial relative to the computational complexity of computing and . For example if and are neural networks with fullyconnected layers of dimensionality (assuming input dimensionality is also ) then the computational complexity of computing and is , which for large networks will likely dwarf . This enables us to reduce the variance of Eq. (10), by taking many Monte Carlo samples with little impact on the training or inference computation time.
4.2 Temperature Parameter Importance
In our softmax approximation, as the temperature gets closer to zero the bias in the approximation to the true objective goes down. But the variance of the gradients of our approximate objective increases [gumbelsoftmax2017, concrete2017]. Thus the temperature parameter controls a biasvariance tradeoff. Tuning this temperature, for example on a validation set, may enable us to find a better point along this tradeoff than just setting for all datasets and models. By increasing the number of Monte Carlo samples we take when estimating the approximate predictive distribution, Eq. (9) we reduce the variance of the loglikelihood gradients. We then have two ways to control the variance of our gradients, and . However varying does not impact the bias of our gradients. As a result, we can predict that for more Monte Carlo samples, , during training, the optimal temperature for maximizing predictive performance is lower for than for and the corresponding predictive performance is higher. This insight demonstrates the importance of our theoretical developments above in understanding the role of the temperature parameter implicitly set to 1 in previous work [kendall2017uncertainties]. We test this empirically below.
5 Related Work
5.1 Aleatoric Uncertainty in Deep Learning
Estimating uncertainty in deep learning has predominately focussed on estimating epistemic uncertainty [gal2016dropout, blundell2015weight]. Nevertheless, seminal work on estimating heteroscedastic aleatoric uncertainty in regression problems with neural networks was done in [bishop1997regression], which allows the variance term in the Gaussian likelihood to vary as a learned function of the inputs. The authors show that maximum likelihood estimation of the model parameters is biased and develop a Bayesian treatment for parameter estimation. Followup work [kendall2017uncertainties] revisits the aforementioned regression model and also introduces the heteroscedastic classification model which has been earlier discussed in this paper. The authors show that these heteroscedastic models can be also combined with MC dropout [gal2016dropout]
approximate Bayesian inference for epistemic uncertainty estimation. The combined heteroscedastic Bayesian model yields improved performance on semantic segmentation and depth regression tasks. An ensembling approach to uncertainty estimation in deep learning has been proposed in
[deepensembles.2018] that suggests using multiple models to estimate both aleatoric and epistemic uncertainty. Along the same lines [liu2019accurate] introduces a Bayesian nonparametric ensemble to estimate both sources of uncertainty. Estimating heteroscedastic aleatoric uncertainty with deep models specifically for images has been studied in [ayhan2018testtimeDA]. The authors suggest using traditional data augmentation methods, such as geometric and color transformations at test time, in order to estimate how much the network output varies in the vicinity of the input examples. Finally, some recent research efforts aim to estimate specific uncertainty metrics. For example, the authors in [pearce2018predictionintervals]introduce a novel loss function, which allows them to use ensemble networks to estimate prediction intervals without making any assumptions on the output distribution. Another work
[natasa2019uncertainties]introduce a loss function in order to learn all the conditional quantiles, used to compute wellcalibrated prediction intervals. For estimating epistemic uncertainty the authors propose a method based on orthonormal certificates.
5.2 Noisy Labels
A large literature exists, which seeks to tackle the problem of noisy labels in training deep neural networks. Most of the methods try to identify samples with incorrect labels and explicitly reduce their influence in the learning process by removal or underweighting the samples in the loss function. We tackle the problem of aleatoric uncertainty more generally which has applications to datasets with incorrectly label samples, but also to uncertainty estimation and problems such as image segmentation where it is unclear how to apply these noisy label methods. Nevertheless, for the sake of completeness, we briefly review some of the key papers tackling noisy labels directly. The MentorNet method [MentorNet.2018] employs a second model (called the MentorNet) that estimates the sample weights (aka curriculum) used to supervise the training of a StudentNet, which is the main network. The MentorNet can be learned to approximate an existing predefined curriculum or discover new curriculums from data. In the latter case the curriculum is learned using a small dataset with clean labels. The Coteaching method [CoTeaching.2018] also jointly trains two neural networks with each network teaching the other. At each training step, each network computes predictions on a minibatch of samples and identifies small loss samples, which are then feed to the other network for learning. The justification behind the use of small loss samples is that it’s more likely they have clean labels. Although Coteaching helps with learning under noisy labels, the two networks tend to converge to a consensus over the training iterations. As a result of this Coteaching reduces to the selftraining MentorNet after a while. In order to address this issue, followup work [CoTeaching++.2019] introduces disagreement between the predictions produced by the two networks. Note finally that in contrast to our method, all of the noisy label approaches outlined above do not provide an estimate of the aleatoric uncertainty, which is a very powerful tool and enables a wide range of learning tasks including image segmentation.
6 Experiments
Public classification datasets are typically collected in such a manner as to avoid noisy labels and heteroscedasticity. In the below experiments we demonstrate our model on a number of datasets where we generate heteroscedasticity synthetically. However there are some public datasets for important machine learning problems that do exhibit heteroscedasticity. Image segmentation is one such example
[kendall2017uncertainties]. So in addition to our synthetic experiments we also evaluate our model on two image segmentation benchmarks PASCAL VOC [everingham2014pascal] and Cityscapes [cordts2016cityscapes]. In general it is difficult to know a priori whether a real world dataset will exhibit heteroscedasticity. We address this question in appendix B, where we show that common problems with real world datasets such as noisy human labellers and missing not at random (MNAR) data satisfy the necessary conditions for heteroscedasticity.6.1 Controlled Label Noise
Dataset  Loss()  Loss()  value  Homoscedastic loss  value  

CIFAR10  2.2  1.295  1.30  9  1.318  5 
MNIST  100  0.767  0.808  2  0.807  6 
SVHN  20  0.794  0.804  4  0.823  1 
value is from a paired sample twotailed ttest where replicates are from corresponding random seeds, 50 replicates are used. Ttest is conducted in reference to the heteroscedastic loss at optimal
. is the temperature which minimizes crossentropy loss on the validation set.We generate heteroscedastic uncertainty synthetically in three standard image classification datasets; CIFAR10 [krizhevsky2009learning], MNIST [lecun1998mnist] and SVHN [netzer2011reading]. We corrupt the labels of some examples in a dataconditional manner as so: Examples with labels 04 are left uncorrupted, then for examples with labels 59 we randomly assign a new label. For examples with label 5, 20% of training examples were assigned a new randomly selected label. 30%, 40%, 50% and 60% of labels 6, 7, 8 and 9 respectively receiving the same treatment. A simple convolutional network architecture is used in each experiment; see appendix D for details. Each plot in fig. 1 shows an average over 50 training runs. The noise model used in all heteroscedastic models is Gaussian. The number of Monte Carlo samples during training is varied across methods, however during predictions on the test set and the validation set (for early stopping) we always use 1,000 samples for all heteroscedastic methods. Results are reported on the test set with similarly corrupted labels.
6.1.1 Do heteroscedastic models outperform homoscedastic models when there exists heteroscedastic noise?
First we wish to verify if in fact heteroscedastic models outperform the standard homoscedastic model when we know there exists heteroscedastic noisy labels. Looking at fig. 1 it is clear that there are large ranges of temperatures for which the heteroscedastic test set accuracy is higher than the homoscedastic model and similarly that the test set loss is lower than the homoscedastic model. This is true for all numbers of training set Monte Carlo samples. See fig. 6 in appendix C for similar results on CIFAR10. We also conduct a more formal test. We select the optimal temperature for each dataset based on the validation set loss. Then we conduct a paired sample ttest between the homoscedastic model’s loss and the heteroscedastic model’s loss at the optimal temperature on the test set, with . Replicates in the ttest are paired by having corresponding random seeds. In Table 1 we see that for each dataset the best heteroscedastic model does in fact outperform the homoscedastic model and that the difference in test set loss is statistically significant.
6.1.2 Is 1.0 always the optimal temperature?
In addition to comparing our method to the homoscedastic baseline, we wish to compare to [kendall2017uncertainties] who implicitly set the softmax temperature to 1 and use a Gaussian noise model on the logits. We first test whether in practice plays the important role we claim in controlling a biasvariance tradeoff or whether the standard choice of setting the softmax temperature to 1 is optimal. For a fair comparison we also restrict the noise model to be Gaussian. We again select the optimal temperature on the validation set and then perform a paired ttest between the heteroscedastic model at optimal temperature and at on the test set. In table 1 we see the results. On all datasets the optimal temperature is not 1 and the difference in performance between the optimal temperature and is statistically significant. Thus the optimal temperature is not always 1 and the performance of heteroscedastic models can be improved by tuning the softmax temperature. Interestingly, note that for MNIST the heteroscedastic model with has higher loss than the homoscedastic model, but the heteroscedastic model at optimal temperature is statistically significantly better than both.
6.1.3 Do lower variance gradients improve predictive performance?
Figures 1 and 6 show the predictive performance as we vary temperature for a number of different values of during training. During evaluation 1,000 Monte Carlo samples are always used. If we take a vertical slice at any temperature along these plots then we are holding the temperature and hence bias constant. Along this vertical slice the number of Monte Carlo samples varies. More Monte Carlo samples reduces the variance of the corresponding gradients. It is clear from these plots, that as expected, increasing the number of Monte Carlo samples and hence reducing the variance of the gradients during training improves the final predictive performance. In particular if we compare the predictive performance with during training vs. on all datasets there is a substantial gap in performance.
6.2 Image Segmentation
Dataset  mIoU()  mIoU()  value  Homoscedastic mIoU  value  

Cityscapes  0.05  76.95%  76.20%  2.0  76.29%  2.4 
PASCAL VOC  0.01  86.03%  84.92%  4.9  85.02%  1.6 
Heteroscedastic models are well suited to image segmentation due to the naturally occurring datadependant uncertainty. It is excessively timeconsuming for a human annotator to individually label pixels in a image (as this would require labelling operations for a single image). In practice human annotators typically label collections of pixels at a time. As a result annotations tend to be noisy at the boundaries of objects where multiple pixels may be labelled together as either the foreground object or background class. We apply our heteroscedastic models with normally distributed latent variables to realworld image segmentation datasets, namely PASCAL VOC 2012 [everingham2014pascal] and Cityscapes [cordts2016cityscapes]. Details about the model architecture and the training setup can be found in Appendix F. On a high level we follow the same endtoend image segmentation setup as in [chen2018deeplabv3+], with the only difference being the application of our heteroscedastic sampling process to the model output. Performance is measured by mean Intersection over Union (mIoU) as is standard for image segmentation. Figs 1(a) and 1(b) show the effect of the temperature on the mean IoU on the two image segmentation datasets, using Monte Carlo samples (see also Eq. (10)). We note that we report validation set results as the test set is not public and only a limited number of submissions to the test server are allowed. The temperature is plotted on a log scale. Heteroscedastic models outperform the homoscedastic model for a wide range of temperatures. Similar to the results on the controlled label noise experiments, the optimal temperature is not found at . In fact, averaging over 20 runs, for both datasets, is outperformed by the homoscedastic model. Table 2 shows that for , the difference in performance between the heteroscedastic model at optimal temperature, the heteroscedastic model at and the homoscedastic model are statistically significant. The difference in the models also leads to qualitatively different segmentations and uncertainty heatmaps. Fig. 3 shows an example segmentation, using the best homoscedastic, heteroscedastic and models trained on PASCAL VOC. Reflecting the improvement in mean IoU the heteroscedastic segmentation at optimal temperature is qualitatively superior. Further examples are shown in Appendix G where we have selected both successful segmentations and failure cases. In addition to the improved segmentation performance, a the key benefit of the heteroscedastic approach is the ability to estimate aleatoric uncertainty. Fig. 4 shows heat maps of perclass variance of the predictive distribution. As expected, the regions of highest aleatoric uncertainty are at object boundaries. Interestingly, the heteroscedastic uncertainty heatmaps at optimal temperature are more fine grained and precise than the heatmaps.
7 Conclusion and Future Work
The [kendall2017uncertainties] deep heteroscedastic classification model has been shown to perform well on tasks such as semantic segmentation. But the use of the softmax link function lacks a theoretical justification. We have argued that the true generative model of the data involves an argmax over latent variables and that the softmax should be viewed as an approximation to this argmax. This approximation is equivalent in a zero temperature limit but in practice the temperature must be tuned to balance bias in the approximation and the variance of the gradients. We have found that this view of heteroscedastic classification models already exists in the econometrics literature but with linear function approximators. By developing this connection to the econometrics literature we place deep heteroscedastic classification models on firmer theoretical footing. We use this theory to improve the predictive performance of these models. By tuning the temperature and the corresponding biasvariance tradeoff, we have shown improved performance on a range of image classificaiton tasks where we add noise to the labels. Likewise on two standard image segmentation benchmarks which naturally have noisy labels tuning the temperature results in significantly improved mean IoU. In fact on three of the five datasets presented above the [kendall2017uncertainties] model is outperformed by a homoscedastic model while our heteroscedastic model at optimal temperature outperforms both the homoscedastic model and the heteroscedastic model at . The above heteroscedastic models relax the identical assumption from the i.i.d. assumption on the additive noise in homoscedastic models, but still assume the noise terms are independent. For some problems this assumption may be unrealistic e.g. for image segmentation the noise on the background class for a pixel at the edge of an object should be anticorrelated with the noise on the object class. In our future work, by developing on our latent variable formulation of deep classification models, we plan to relax this independence assumption in addition to the identical assumption.
References
Appendix A Heteroscedastic Binary Classification
For multiclass classification we use the softmax as a smoothing function for the argmax with the guarantee of equivalence in a zero temperature limit. For binary classification it is more convenient to avoid the use of the vector valued argmax and softmax functions and simply have the model output the probability of one class being chosen,
, in which case the probability of the other class is simply :(11) 
The key step is to replace the difference of the two latent variables with a single latent variable which is valid as all latent variables are members of the locationscale family . This sigmoid smoothing function has also been used in the econometrics literature [train2009discrete].
Appendix B Necessary Conditions for Heteroscedasticity
Given a classification task, it can be difficult to tell a priori whether the problem is heteroscedastic or homoscedastic. Here we state the necessary conditions for heteroscedasticity to exist. We also provide some graphical models that satisfy these conditions and correspond to reasonable models of real world applications. In practice it is always an empirical question as to whether some particular heteroscedastic model outperforms a homoscedastic model, but we hope these examples will provide a framework to think about heteroscedastic modelling and when it is likely to be helpful for a given task. We wish to predict a label given some observed variables . In order for us to be uncertain about the value of there must be some other unobserved variables which influence . If and are independent e.g. we assume is the source of additive noise in a latent variable model then a heteroscedastic model won’t help as we only observe which is independent of . Hence the necessary conditions for heteroscedasticity are . Figure 5 shows some graphical models which satisfy these conditions. Our synthetically generated noisy labels are well modeled by fig. 4(a). Academic datasets typically do not have missing elements in the input features. But when working on applications of machine learning, real world datasets very often having missing data. Interestingly datasets with missing not at random (MNAR) missingness [little2014statistical] satisfy the necessary conditions for heteroscedasticity. Fig. 4(b) may be a reasonable graphical model for MNAR missing data where the missing components may be related in complex bidirectional relationships with the observed variables. Fig. 4(c) models cases such as image segmentation. Here the observed variables are predictive of imagined “true labels”, but we must deal with human labelled examples. In the image segmentation example, whether the pixel is at the boundary of an object (observed in ) combined with unobserved features of the human labeller such as labelling method, speed of labelling, attention to detail, etc. interact to yield the observed labels.
Appendix C Controlled Label Noise: Further Results
Due to lack of space we provide in Fig. 6 the classification results for CIFAR10.
Appendix D Controlled Label Noise Experiments: Architecture and Training Details
For MNIST experiments we use a three hidden layer MLP with 1024 units at each layer and leakyReLU activation function. For CIFAR10 and SVHN experiments we use a convolutional neural network with three convolution layers with 326464 3x3 filters and stride=1. Each convolutional layer is followed by a maxpooling layer with 2x2 window and stride=1. Two fullyconnected layers of 256128 units follow the convolutional layers. All hidden layers use the leakyReLU activation function. We train with Adam with default parameters; learning rate=0.001,
, ,. Networks are trained for a maximum of 1,000 epochs, being stopped early if validation set loss has not improved in 10 epochs. The best validation set checkpoint is used for test set evaluation.
Appendix E Text Classification: Experiments
We have also tested our method on data of different modality beyond images. In particular, we have looked into text classification using the publicly available text classification datasets summarized in Table 3. We use a similar setup as in Section D with controlled label noise. For the Global Warming dataset that has 2 classes, we use the binary classification formulation discussed earlier in Section A. Given that the input data is pure text, we use pretrained text embeddings^{1}^{1}1https://tfhub.dev/google/tf2preview/nnlmendim128/1
that are available in TensorFlow Hub. We chose to use simple neural networks that consist of the pretrained embedding module followed by a single dense layer of size 32 and the final heteroscedastic layer. For the experimental validation we used a controlled label noise setup whereby for half of the classes we randomly flip the label and the other half remains untouched. The flip probability depends on the class index and ramps up linearly from 0.25 to 0.5. Fig.
7 shows the behaviour of the test accuracy as well as test loss with respect to the temperature. The results on the Global warming at optimal temperature (selected on the validation set) vs. the homoscedastic model are statistically significant for the test loss and for the Political Message are statistically significant in both (loss and accuracy) metrics. Notice that once more the optimal temperature may not be always equal to one.Data set  Train Examples  Val. Examples  Test Examples  Classes 

Global Warming  3380  422  423  2 
Political Message  4000  500  500  9 
Appendix F Image Segmentation: Architecture and Training Setup
We replicate the DeepLabv3+ [chen2018deeplabv3+] architecture and training setup which achieves stateoftheart image segmentation results. DeepLabv3+ uses an Xception [chollet2017xception] based architecture with an added decoder module introduced. In particular we use the Xception65 architecture with an output stride of [chen2018deeplabv3+]. Following DeepLabv3+ all our methods are warmstarted from the same checkpoint which is a network pretrained on JFT and MSCoco. We train for 150K steps using the SGD optimizer with learning rate of and otherwise default parameters. We use a batch size of and keep fixed the batch normalisation statistics from the pretrained checkpoint. Training is performed on the same augmented training data set used for DeepLabV3+
. In the homoscedastic model, a single convolution is applied to the output of a decoder, followed by bilinear upsampling to the size of the image, in order to compute logits for each pixel. Strictly speaking, either the final features should be upsampled to the original image size and used to compute correctly sized scale and location parameters, or the scale and location parameters should be computed at a lower dimension and upsampled to full size. However, this increases the number of Monte Carlo samples required (by a factor of 16 in this case), which makes it difficult to fit in the memory on a single GPU. We therefore sample the “logits” at a lower dimension and upsample to full image dimensions by bilinear interpolation.
Appendix G Image segmentation examples
Image Ground truth Homoscedastic Heteroscedastic
Heteroscedastic
Heteroscedastic
Heteroscedastic
Image Ground truth Homoscedastic Heteroscedastic
Heteroscedastic
Heteroscedastic
Heteroscedastic
Image Ground truth Homoscedastic Heteroscedastic
Heteroscedastic
Heteroscedastic
Heteroscedastic
Image Ground truth Homoscedastic Heteroscedastic
Heteroscedastic
Heteroscedastic
Heteroscedastic
Image Ground truth Homoscedastic Heteroscedastic
Heteroscedastic
Heteroscedastic
Heteroscedastic
Image Ground truth Homoscedastic Heteroscedastic
Heteroscedastic
Heteroscedastic
Heteroscedastic
Image Ground truth Homoscedastic Heteroscedastic
Heteroscedastic
Heteroscedastic
Heteroscedastic
Image Ground truth Homoscedastic Heteroscedastic
Heteroscedastic
Heteroscedastic
Heteroscedastic
Image Ground truth Homoscedastic Heteroscedastic
Heteroscedastic
Heteroscedastic
Heteroscedastic