Analysis of Softmax Approximation for Deep Classifiers under Input-Dependent Label Noise

03/15/2020 ∙ by Mark Collier, et al. ∙ Google 5

Modelling uncertainty arising from input-dependent label noise is an increasingly important problem. A state-of-the-art approach for classification [Kendall and Gal, 2017] places a normal distribution over the softmax logits, where the mean and variance of this distribution are learned functions of the inputs. This approach achieves impressive empirical performance but lacks theoretical justification. We show that this model is a special case of a well known and theoretically understood model studied in econometrics. Under this view the softmax over the logit distribution is a smooth approximation to an argmax, where the approximation is exact in the zero temperature limit. We further illustrate that the softmax temperature controls a bias-variance trade-off and the optimal point on this trade-off is not always found at 1.0. By tuning the softmax temperature, we achieve improved performance on well known image classification benchmarks with controlled label noise. For image segmentation, where input-dependent label noise naturally arises, we show that tuning the temperature increases the mean IoU on the PASCAL VOC and Cityscapes datasets by more than 1 that does not model this noise source.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With deep learning models increasingly being deployed in safety critical applications, estimating a model’s uncertainty in its predictive distribution has become a more pressing problem

[gal2016uncertainty]. The uncertainty of a statistical model can be divided into aleatoric and epistemic uncertainty [kendall2017uncertainties]:

  • Aleatoric uncertainty

    captures stochasticity in the labels for supervised learning tasks. This uncertainty could be the result of noisy measurements, mis-labelled samples, unobserved factors that influence the labels, and so on.

  • Epistemic uncertainty

    captures uncertainty about the model that generated our data. For neural networks, this is often thought of as the uncertainty in the chosen architecture’s parameters but could also include uncertainty in the architecture itself.

The predictive uncertainty of a model is the combination of its aleatoric and epistemic uncertainty. In this paper we address the modelling of aleatoric uncertainty for classification tasks. Our approach can be easily combined with many approaches in the suite of Bayesian neural networks that estimate the epistemic uncertainty of a model [gal2016dropout, gal2017concrete, blundell2015weight]

resulting in an estimate of the full predictive uncertainty. However, this is not the focus of this work. Aleatoric uncertainty can be either characterized as homoscedastic or heteroscedastic uncertainty:

  • Homoscedastic: the uncertainty in the labels is constant across the input space i.e. all labels are equally noisy.

  • Heteroscedastic: the aleatoric uncertainty varies across the input space, e.g. certain examples may be more difficult to label manually than others.

If a dataset has heteroscedastic noise, then modelling heteroscedasticity is crucial for calibrated uncertainty quantification. In addition, if the data’s “true” model is heteroscedastic, then maximum likelihood estimation of the parameters of a non-linear homoscedastic model is biased and inconsistent [greene2012econometric]. Thus, for datasets with such uncertainty, improved modelling of heteroscedasticity promises not only better calibrated uncertainty quantification but also less biased parameter estimation and consequently improved predictive performance as measured by some metric of interest e.g. accuracy. The main contributions of this paper are:

  1. Provide a theoretical framework for the state-of-the-art neural network heteroscedastic classification method by establishing a connection to a well studied method in the econometrics literature.

  2. Via this framework, provide a means of understanding the importance of the softmax temperature parameter in controlling an approximation to the objective under the true model.

  3. Provide empirical evidence demonstrating significantly improved predictive performance on a range of tasks by tuning this temperature parameter vs. implicitly setting it to 1.0.

2 Background

In order to motivate our development of heteroscedastic classification models we shall first review one formulation of a heteroscedastic regression model as it is particularly amenable to interpretation [kendall2017uncertainties].

2.1 Heteroscedastic Regression Models

We have a dataset of examples: where is real valued. We assume that are i.i.d. such that , where and

are any parametric models parameterized by

. The log-likelihood of our data is:

(1)

If we set this reduces to a standard homoscedastic regression model. However for a non-constant function , this differs from a standard regression model in that the squared error loss for each example is weighted by

. Those examples with higher predicted aleatoric uncertainty will have less weight in the log-likelihood being maximized and consequently the model will focus less on trying to explain them well. It is clear how such a log-likelihood can make a model more robust to “outliers” and noisy labels.

2.2 Heteroscedastic Classification Models

Following on from their development of a neural network heteroscedastic regression model, [kendall2017uncertainties]

develop a similarly structured heteroscedastic classification model, which is the state-of-the-art approach for neural network heteroscedastic classification. They place a Gaussian distribution on the logits of a standard softmax classification model, making the logits latent variables:

(2)

where

is the probability the label is class

and is the number of classes. The model’s log-likelihood and a Monte Carlo method for estimating it are also developed, see [kendall2017uncertainties] for details.

3 Latent Variable Heteroscedastic Classification Models

Heteroscedastic classification models have been widely studied in the econometric literature as a branch of discrete choice models [train2009discrete, mcfadden1973conditional, bhat1995heteroscedastic, munizaga2000representation]. It is typically assumed that there is some utility associated with each class. Utility is decomposed into reference utility which is a function of observed variables and an unobserved stochastic component . The probability that class is selected is the probability that the latent utility is greater than the corresponding value for all other classes:

(3)

In the econometric literature, the reference utility model is often chosen to be a linear function. Restricting ourselves briefly to homoscedastic models, if we assume is i.i.d.  then this latent variable formulation leads to a multinomial logistic model which is exactly equivalent to the softmax classification models with cross-entropy loss widely used with neural networks [train2009discrete]. Similarly if we choose i.i.d. this leads to a multinomial probit model with identity covariance matrix. Heteroscedastic models however break the assumption that the additive noise terms are identically distributed, though they are still assumed to be independent. Computing or its gradient involves an integral over for which no analytic solutions exist in the heteroscedastic case [train2009discrete, bhat1995heteroscedastic]. Similar latent variable formulations of heteroscedastic classification models have also been developed in the Gaussian Process literature where naturally it is assumed the latent noise is distributed Normally [hernandez2014mind, williams2006gaussian] and a GP prior is placed on and . Again exact inference on the likelihood is intractable and approximate inference methods are used [hernandez2014mind]. Having seen this latent variable formulation of heteroscedastic classification models we can now revisit the the neural network model introduced in Eq. (2), and attempt to cast it similarly as a latent variable model:

(4)

But if we now assume that the probability of the label being class is the probability of the latent for class being maximum, then in general we do not get the same as assumed in [kendall2017uncertainties]:

(5)

Thus when we interpret the heteroscedastic classification model from [kendall2017uncertainties] as a latent variable model, which it implicitly is, we see that the computation of the predictive probabilities are not theoretically justified.

3.1 Location-Scale Family Latent Variable Heteroscedastic Classification

We can generalize latent variable heteroscedastic models by allowing the distribution on to be any distribution including but not limited to the Normal. As will be seen later, in the deep learning context it will be beneficial to restrict the distribution on to be any location-scale family of distributions e.g. Normal, Gumbel, Logistic. Each member of this family is parameterized by a location parameter and a scale parameter . In terms of our latent variable heteroscedastic model above we allow where is a location-scale family.

(6)

is also a member of by the properties of a location-scale family and once again the probability that a particular class is chosen is simply the probability that the latent variable is the maximum of all latent variables .

(7)

4 Deep Heteroscedastic Classification as a Special Case of the Smoothed Accept-Reject Simulator

We have seen that the existing state-of-the-art neural network heteroscedastic model is inexact when interpreted as a latent variable model. We will show that this model is a special case of a well studied approximation in the econometric literature.

4.1 Softmax Approximation

Computing the value of Eq. (7) requires an integral over which in general we cannot compute analytically [train2009discrete]. Eq. (7) is an expectation over the unobserved component of utility with can be estimated with Monte Carlo methods. However direct Monte Carlo estimation still has the issue that the argmax function’s derivatives are either zero or undefined, everywhere in its domain. Therefore we seek a smooth approximation to Eq. (7), similar to the development of the Gumbel-Softmax/Concrete distribution [gumbelsoftmax2017, concrete2017]. We note that in a zero temperature limit the softmax function is equivalent to the argmax, hence:

(8)

where the expectation above is over , and is a 0-1 indicator function. A similar result for the binary classification case with sigmoid smoothing function is derived in appendix A. In the econometric literature this approximation is known as the logit-smoothed accept-reject simulator [train2009discrete, mcfadden1989method]. When is chosen to be Normal it is known as the “logit-kernel probit” [bolduc1996multinomial]. In the context of heteroscedastic models, the smoothed accept-reject simulator can be viewed as a smooth approximation to the true model, where the approximation is exact in a zero temperature limit. While on the surface, this approximation seems similar to the Gumbel-Softmax trick, is only distributed according to the Concrete distribution iff , which would be a homoscedastic model. In our case, we are free to choose any location scale family . And as we are particularly interested in heteroscedastic models, we do not restrict to be a constant function, typically choosing to parameterize as a deep neural network. Despite not motivating it as such, the neural network heteroscedastic classification model with Gaussian noise on the softmax logits, Eq. (4), is in fact a special case of the smoothed accept-reject simulator with and . However the authors do not motivate their model as an approximation to a true model or recognise the importance of the temperature parameter in controlling that approximation. As both and are in a location-scale family, they can be defined in terms of a deterministic location-scale transformation , of . We can thus apply the re-parametrisation trick [diederik2014auto] allowing , , which implies:

(9)

A Monte Carlo estimate of the approximate predictive probabilities can be obtained:

(10)

Note that, due to the reparamertization, gradients for Eq. (10) w.r.t.  can be taken. Once we have computed and , computing the Monte Carlo estimate i.e., Eq. (10) with samples, has computational cost . This computational cost is typically trivial relative to the computational complexity of computing and . For example if and are neural networks with fully-connected layers of dimensionality (assuming input dimensionality is also ) then the computational complexity of computing and is , which for large networks will likely dwarf . This enables us to reduce the variance of Eq. (10), by taking many Monte Carlo samples with little impact on the training or inference computation time.

4.2 Temperature Parameter Importance

In our softmax approximation, as the temperature gets closer to zero the bias in the approximation to the true objective goes down. But the variance of the gradients of our approximate objective increases [gumbelsoftmax2017, concrete2017]. Thus the temperature parameter controls a bias-variance trade-off. Tuning this temperature, for example on a validation set, may enable us to find a better point along this trade-off than just setting for all datasets and models. By increasing the number of Monte Carlo samples we take when estimating the approximate predictive distribution, Eq. (9) we reduce the variance of the log-likelihood gradients. We then have two ways to control the variance of our gradients, and . However varying does not impact the bias of our gradients. As a result, we can predict that for more Monte Carlo samples, , during training, the optimal temperature for maximizing predictive performance is lower for than for and the corresponding predictive performance is higher. This insight demonstrates the importance of our theoretical developments above in understanding the role of the temperature parameter implicitly set to 1 in previous work [kendall2017uncertainties]. We test this empirically below.

5 Related Work

5.1 Aleatoric Uncertainty in Deep Learning

Estimating uncertainty in deep learning has predominately focussed on estimating epistemic uncertainty  [gal2016dropout, blundell2015weight]. Nevertheless, seminal work on estimating heteroscedastic aleatoric uncertainty in regression problems with neural networks was done in  [bishop1997regression], which allows the variance term in the Gaussian likelihood to vary as a learned function of the inputs. The authors show that maximum likelihood estimation of the model parameters is biased and develop a Bayesian treatment for parameter estimation. Follow-up work  [kendall2017uncertainties] revisits the aforementioned regression model and also introduces the heteroscedastic classification model which has been earlier discussed in this paper. The authors show that these heteroscedastic models can be also combined with MC dropout  [gal2016dropout]

approximate Bayesian inference for epistemic uncertainty estimation. The combined heteroscedastic Bayesian model yields improved performance on semantic segmentation and depth regression tasks. An ensembling approach to uncertainty estimation in deep learning has been proposed in  

[deepensembles.2018] that suggests using multiple models to estimate both aleatoric and epistemic uncertainty. Along the same lines  [liu2019accurate] introduces a Bayesian non-parametric ensemble to estimate both sources of uncertainty. Estimating heteroscedastic aleatoric uncertainty with deep models specifically for images has been studied in  [ayhan2018testtimeDA]. The authors suggest using traditional data augmentation methods, such as geometric and color transformations at test time, in order to estimate how much the network output varies in the vicinity of the input examples. Finally, some recent research efforts aim to estimate specific uncertainty metrics. For example, the authors in  [pearce2018predictionintervals]

introduce a novel loss function, which allows them to use ensemble networks to estimate prediction intervals without making any assumptions on the output distribution. Another work  

[natasa2019uncertainties]

introduce a loss function in order to learn all the conditional quantiles, used to compute well-calibrated prediction intervals. For estimating epistemic uncertainty the authors propose a method based on orthonormal certificates.

5.2 Noisy Labels

A large literature exists, which seeks to tackle the problem of noisy labels in training deep neural networks. Most of the methods try to identify samples with incorrect labels and explicitly reduce their influence in the learning process by removal or under-weighting the samples in the loss function. We tackle the problem of aleatoric uncertainty more generally which has applications to datasets with incorrectly label samples, but also to uncertainty estimation and problems such as image segmentation where it is unclear how to apply these noisy label methods. Nevertheless, for the sake of completeness, we briefly review some of the key papers tackling noisy labels directly. The MentorNet method  [MentorNet.2018] employs a second model (called the MentorNet) that estimates the sample weights (aka curriculum) used to supervise the training of a StudentNet, which is the main network. The MentorNet can be learned to approximate an existing pre-defined curriculum or discover new curriculums from data. In the latter case the curriculum is learned using a small dataset with clean labels. The Co-teaching method  [CoTeaching.2018] also jointly trains two neural networks with each network teaching the other. At each training step, each network computes predictions on a mini-batch of samples and identifies small loss samples, which are then feed to the other network for learning. The justification behind the use of small loss samples is that it’s more likely they have clean labels. Although Co-teaching helps with learning under noisy labels, the two networks tend to converge to a consensus over the training iterations. As a result of this Co-teaching reduces to the self-training MentorNet after a while. In order to address this issue, follow-up work  [CoTeaching++.2019] introduces disagreement between the predictions produced by the two networks. Note finally that in contrast to our method, all of the noisy label approaches outlined above do not provide an estimate of the aleatoric uncertainty, which is a very powerful tool and enables a wide range of learning tasks including image segmentation.

6 Experiments

Figure 1: Effect of temperature on test accuracy (left) and test loss (right) for the experiments with controlled label noise for MNIST and SVHN datasets.

Public classification datasets are typically collected in such a manner as to avoid noisy labels and heteroscedasticity. In the below experiments we demonstrate our model on a number of datasets where we generate heteroscedasticity synthetically. However there are some public datasets for important machine learning problems that do exhibit heteroscedasticity. Image segmentation is one such example

[kendall2017uncertainties]. So in addition to our synthetic experiments we also evaluate our model on two image segmentation benchmarks PASCAL VOC [everingham2014pascal] and Cityscapes [cordts2016cityscapes]. In general it is difficult to know a priori whether a real world dataset will exhibit heteroscedasticity. We address this question in appendix B, where we show that common problems with real world datasets such as noisy human labellers and missing not at random (MNAR) data satisfy the necessary conditions for heteroscedasticity.

6.1 Controlled Label Noise

Dataset Loss() Loss() -value Homoscedastic loss -value
CIFAR-10 2.2 1.295 1.30 9 1.318 5
MNIST 100 0.767 0.808 2 0.807 6
SVHN 20 0.794 0.804 4 0.823 1
Table 1: Heteroscedastic test set cross-entropy loss at optimal vs.  vs. homoscedastic test set cross-entropy loss. Optimal is determined on the validation set. .

-value is from a paired sample two-tailed t-test where replicates are from corresponding random seeds, 50 replicates are used. T-test is conducted in reference to the heteroscedastic loss at optimal

. is the temperature which minimizes cross-entropy loss on the validation set.

We generate heteroscedastic uncertainty synthetically in three standard image classification datasets; CIFAR-10 [krizhevsky2009learning], MNIST [lecun1998mnist] and SVHN [netzer2011reading]. We corrupt the labels of some examples in a data-conditional manner as so: Examples with labels 0-4 are left uncorrupted, then for examples with labels 5-9 we randomly assign a new label. For examples with label 5, 20% of training examples were assigned a new randomly selected label. 30%, 40%, 50% and 60% of labels 6, 7, 8 and 9 respectively receiving the same treatment. A simple convolutional network architecture is used in each experiment; see appendix D for details. Each plot in fig. 1 shows an average over 50 training runs. The noise model used in all heteroscedastic models is Gaussian. The number of Monte Carlo samples during training is varied across methods, however during predictions on the test set and the validation set (for early stopping) we always use 1,000 samples for all heteroscedastic methods. Results are reported on the test set with similarly corrupted labels.

6.1.1 Do heteroscedastic models outperform homoscedastic models when there exists heteroscedastic noise?

First we wish to verify if in fact heteroscedastic models outperform the standard homoscedastic model when we know there exists heteroscedastic noisy labels. Looking at fig. 1 it is clear that there are large ranges of temperatures for which the heteroscedastic test set accuracy is higher than the homoscedastic model and similarly that the test set loss is lower than the homoscedastic model. This is true for all numbers of training set Monte Carlo samples. See fig. 6 in appendix C for similar results on CIFAR-10. We also conduct a more formal test. We select the optimal temperature for each dataset based on the validation set loss. Then we conduct a paired sample t-test between the homoscedastic model’s loss and the heteroscedastic model’s loss at the optimal temperature on the test set, with . Replicates in the t-test are paired by having corresponding random seeds. In Table 1 we see that for each dataset the best heteroscedastic model does in fact outperform the homoscedastic model and that the difference in test set loss is statistically significant.

6.1.2 Is 1.0 always the optimal temperature?

In addition to comparing our method to the homoscedastic baseline, we wish to compare to [kendall2017uncertainties] who implicitly set the softmax temperature to 1 and use a Gaussian noise model on the logits. We first test whether in practice plays the important role we claim in controlling a bias-variance trade-off or whether the standard choice of setting the softmax temperature to 1 is optimal. For a fair comparison we also restrict the noise model to be Gaussian. We again select the optimal temperature on the validation set and then perform a paired t-test between the heteroscedastic model at optimal temperature and at on the test set. In table 1 we see the results. On all datasets the optimal temperature is not 1 and the difference in performance between the optimal temperature and is statistically significant. Thus the optimal temperature is not always 1 and the performance of heteroscedastic models can be improved by tuning the softmax temperature. Interestingly, note that for MNIST the heteroscedastic model with has higher loss than the homoscedastic model, but the heteroscedastic model at optimal temperature is statistically significantly better than both.

6.1.3 Do lower variance gradients improve predictive performance?

Figures 1 and 6 show the predictive performance as we vary temperature for a number of different values of during training. During evaluation 1,000 Monte Carlo samples are always used. If we take a vertical slice at any temperature along these plots then we are holding the temperature and hence bias constant. Along this vertical slice the number of Monte Carlo samples varies. More Monte Carlo samples reduces the variance of the corresponding gradients. It is clear from these plots, that as expected, increasing the number of Monte Carlo samples and hence reducing the variance of the gradients during training improves the final predictive performance. In particular if we compare the predictive performance with during training vs.  on all datasets there is a substantial gap in performance.

(a) PASCAL VOC
(b) Cityscapes performance
Figure 2: Heteroscedastic vs homoscedastic image segmentation on PASCAL VOC and Cityscapes datasets. Results are averaged over 20 random seeds; the shaded area shows standard deviation. The temperature is plotted in log scale.

6.2 Image Segmentation

Dataset mIoU() mIoU() -value Homoscedastic mIoU -value
Cityscapes 0.05 76.95% 76.20% 2.0 76.29% 2.4
PASCAL VOC 0.01 86.03% 84.92% 4.9 85.02% 1.6
Table 2: Image segmentation results for heteroscedastic and homoscedastic models. . -values are from a paired sample two-tailed t-test where replicas are from corresponding random seeds, 20 replicates are used. T-tests are conducted in reference to the heteroscedastic mIoU at optimal .
Figure 3: Image segmentation results for a selected image from the internet (outside of the PASCAL VOC dataset).
Figure 4: Heatmaps of aleatoric uncertainty (predictive distribution per class variance) for two heteroscedastic models corresponding to (top) and the optimal temperature (bottom). Estimated using samples.

Heteroscedastic models are well suited to image segmentation due to the naturally occurring data-dependant uncertainty. It is excessively time-consuming for a human annotator to individually label pixels in a image (as this would require labelling operations for a single image). In practice human annotators typically label collections of pixels at a time. As a result annotations tend to be noisy at the boundaries of objects where multiple pixels may be labelled together as either the foreground object or background class. We apply our heteroscedastic models with normally distributed latent variables to real-world image segmentation datasets, namely PASCAL VOC 2012 [everingham2014pascal] and Cityscapes [cordts2016cityscapes]. Details about the model architecture and the training setup can be found in Appendix F. On a high level we follow the same end-to-end image segmentation setup as in [chen2018deeplabv3+], with the only difference being the application of our heteroscedastic sampling process to the model output. Performance is measured by mean Intersection over Union (mIoU) as is standard for image segmentation. Figs 1(a) and 1(b) show the effect of the temperature on the mean IoU on the two image segmentation datasets, using Monte Carlo samples (see also Eq. (10)). We note that we report validation set results as the test set is not public and only a limited number of submissions to the test server are allowed. The temperature is plotted on a log scale. Heteroscedastic models outperform the homoscedastic model for a wide range of temperatures. Similar to the results on the controlled label noise experiments, the optimal temperature is not found at . In fact, averaging over 20 runs, for both datasets, is outperformed by the homoscedastic model. Table 2 shows that for , the difference in performance between the heteroscedastic model at optimal temperature, the heteroscedastic model at and the homoscedastic model are statistically significant. The difference in the models also leads to qualitatively different segmentations and uncertainty heatmaps. Fig. 3 shows an example segmentation, using the best homoscedastic, heteroscedastic and models trained on PASCAL VOC. Reflecting the improvement in mean IoU the heteroscedastic segmentation at optimal temperature is qualitatively superior. Further examples are shown in Appendix G where we have selected both successful segmentations and failure cases. In addition to the improved segmentation performance, a the key benefit of the heteroscedastic approach is the ability to estimate aleatoric uncertainty. Fig. 4 shows heat maps of per-class variance of the predictive distribution. As expected, the regions of highest aleatoric uncertainty are at object boundaries. Interestingly, the heteroscedastic uncertainty heatmaps at optimal temperature are more fine grained and precise than the heatmaps.

7 Conclusion and Future Work

The [kendall2017uncertainties] deep heteroscedastic classification model has been shown to perform well on tasks such as semantic segmentation. But the use of the softmax link function lacks a theoretical justification. We have argued that the true generative model of the data involves an argmax over latent variables and that the softmax should be viewed as an approximation to this argmax. This approximation is equivalent in a zero temperature limit but in practice the temperature must be tuned to balance bias in the approximation and the variance of the gradients. We have found that this view of heteroscedastic classification models already exists in the econometrics literature but with linear function approximators. By developing this connection to the econometrics literature we place deep heteroscedastic classification models on firmer theoretical footing. We use this theory to improve the predictive performance of these models. By tuning the temperature and the corresponding bias-variance trade-off, we have shown improved performance on a range of image classificaiton tasks where we add noise to the labels. Likewise on two standard image segmentation benchmarks which naturally have noisy labels tuning the temperature results in significantly improved mean IoU. In fact on three of the five datasets presented above the [kendall2017uncertainties] model is outperformed by a homoscedastic model while our heteroscedastic model at optimal temperature outperforms both the homoscedastic model and the heteroscedastic model at . The above heteroscedastic models relax the identical assumption from the i.i.d. assumption on the additive noise in homoscedastic models, but still assume the noise terms are independent. For some problems this assumption may be unrealistic e.g. for image segmentation the noise on the background class for a pixel at the edge of an object should be anti-correlated with the noise on the object class. In our future work, by developing on our latent variable formulation of deep classification models, we plan to relax this independence assumption in addition to the identical assumption.

References

Appendix A Heteroscedastic Binary Classification

For multi-class classification we use the softmax as a smoothing function for the argmax with the guarantee of equivalence in a zero temperature limit. For binary classification it is more convenient to avoid the use of the vector valued argmax and softmax functions and simply have the model output the probability of one class being chosen,

, in which case the probability of the other class is simply :

(11)

The key step is to replace the difference of the two latent variables with a single latent variable which is valid as all latent variables are members of the location-scale family . This sigmoid smoothing function has also been used in the econometrics literature [train2009discrete].

(a) Noisy labels,

(b) MNAR missing data

(c) Hard examples
Figure 5: Example graphical models satisfying conditions for heteroscedasticity.

Appendix B Necessary Conditions for Heteroscedasticity

Given a classification task, it can be difficult to tell a priori whether the problem is heteroscedastic or homoscedastic. Here we state the necessary conditions for heteroscedasticity to exist. We also provide some graphical models that satisfy these conditions and correspond to reasonable models of real world applications. In practice it is always an empirical question as to whether some particular heteroscedastic model outperforms a homoscedastic model, but we hope these examples will provide a framework to think about heteroscedastic modelling and when it is likely to be helpful for a given task. We wish to predict a label given some observed variables . In order for us to be uncertain about the value of there must be some other unobserved variables which influence . If and are independent e.g. we assume is the source of additive noise in a latent variable model then a heteroscedastic model won’t help as we only observe which is independent of . Hence the necessary conditions for heteroscedasticity are . Figure 5 shows some graphical models which satisfy these conditions. Our synthetically generated noisy labels are well modeled by fig. 4(a). Academic datasets typically do not have missing elements in the input features. But when working on applications of machine learning, real world datasets very often having missing data. Interestingly datasets with missing not at random (MNAR) missingness [little2014statistical] satisfy the necessary conditions for heteroscedasticity. Fig. 4(b) may be a reasonable graphical model for MNAR missing data where the missing components may be related in complex bi-directional relationships with the observed variables. Fig. 4(c) models cases such as image segmentation. Here the observed variables are predictive of imagined “true labels”, but we must deal with human labelled examples. In the image segmentation example, whether the pixel is at the boundary of an object (observed in ) combined with unobserved features of the human labeller such as labelling method, speed of labelling, attention to detail, etc. interact to yield the observed labels.

Appendix C Controlled Label Noise: Further Results

Due to lack of space we provide in Fig. 6 the classification results for CIFAR-10.

Figure 6: Effect of temperature on test accuracy (left) and test loss (right) on CIFAR-10.

Appendix D Controlled Label Noise Experiments: Architecture and Training Details

For MNIST experiments we use a three hidden layer MLP with 1024 units at each layer and leaky-ReLU activation function. For CIFAR-10 and SVHN experiments we use a convolutional neural network with three convolution layers with 32-64-64 3x3 filters and stride=1. Each convolutional layer is followed by a max-pooling layer with 2x2 window and stride=1. Two fully-connected layers of 256-128 units follow the convolutional layers. All hidden layers use the leaky-ReLU activation function. We train with Adam with default parameters; learning rate=0.001,

, ,

. Networks are trained for a maximum of 1,000 epochs, being stopped early if validation set loss has not improved in 10 epochs. The best validation set checkpoint is used for test set evaluation.

Appendix E Text Classification: Experiments

We have also tested our method on data of different modality beyond images. In particular, we have looked into text classification using the publicly available text classification datasets summarized in Table 3. We use a similar setup as in Section D with controlled label noise. For the Global Warming dataset that has 2 classes, we use the binary classification formulation discussed earlier in Section A. Given that the input data is pure text, we use pre-trained text embeddings111https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1

that are available in TensorFlow Hub. We chose to use simple neural networks that consist of the pre-trained embedding module followed by a single dense layer of size 32 and the final heteroscedastic layer. For the experimental validation we used a controlled label noise setup whereby for half of the classes we randomly flip the label and the other half remains untouched. The flip probability depends on the class index and ramps up linearly from 0.25 to 0.5. Fig. 

7 shows the behaviour of the test accuracy as well as test loss with respect to the temperature. The results on the Global warming at optimal temperature (selected on the validation set) vs. the homoscedastic model are statistically significant for the test loss and for the Political Message are statistically significant in both (loss and accuracy) metrics. Notice that once more the optimal temperature may not be always equal to one.

Data set Train Examples Val. Examples Test Examples Classes
Global Warming 3380 422 423 2
Political Message 4000 500 500 9
Table 3: Statistics for the NLP classification data sets. Number of examples in the training set, validation set and test set and number of classes. All data sets are publicly available from crowdflower.com.
Figure 7: Effect of temperature on test accuracy (left) and test loss (right) on the Global Warming (top) and Political Message (bottom) text classification datasets.

Appendix F Image Segmentation: Architecture and Training Setup

We replicate the DeepLabv3+ [chen2018deeplabv3+] architecture and training setup which achieves state-of-the-art image segmentation results. DeepLabv3+ uses an Xception [chollet2017xception] based architecture with an added decoder module introduced. In particular we use the Xception65 architecture with an output stride of [chen2018deeplabv3+]. Following DeepLabv3+ all our methods are warm-started from the same checkpoint which is a network pre-trained on JFT and MSCoco. We train for 150K steps using the SGD optimizer with learning rate of and otherwise default parameters. We use a batch size of and keep fixed the batch normalisation statistics from the pre-trained checkpoint. Training is performed on the same augmented training data set used for DeepLabV3+

. In the homoscedastic model, a single convolution is applied to the output of a decoder, followed by bilinear upsampling to the size of the image, in order to compute logits for each pixel. Strictly speaking, either the final features should be upsampled to the original image size and used to compute correctly sized scale and location parameters, or the scale and location parameters should be computed at a lower dimension and upsampled to full size. However, this increases the number of Monte Carlo samples required (by a factor of 16 in this case), which makes it difficult to fit in the memory on a single GPU. We therefore sample the “logits” at a lower dimension and upsample to full image dimensions by bilinear interpolation.

Appendix G Image segmentation examples

Image                                 Ground truth                                 Homoscedastic                     Heteroscedastic             Heteroscedastic Heteroscedastic Heteroscedastic
                           Image                                 Ground truth                                 Homoscedastic                     Heteroscedastic             Heteroscedastic Heteroscedastic Heteroscedastic
                           Image                                 Ground truth                                 Homoscedastic                     Heteroscedastic             Heteroscedastic Heteroscedastic Heteroscedastic
                           Image                                 Ground truth                                 Homoscedastic                     Heteroscedastic             Heteroscedastic Heteroscedastic Heteroscedastic
                           Image                                 Ground truth                                 Homoscedastic                     Heteroscedastic             Heteroscedastic Heteroscedastic Heteroscedastic
                           Image                                 Ground truth                                 Homoscedastic                     Heteroscedastic             Heteroscedastic Heteroscedastic Heteroscedastic
 

Image                                 Ground truth                                 Homoscedastic                     Heteroscedastic             Heteroscedastic Heteroscedastic Heteroscedastic
                           Image                                 Ground truth                                 Homoscedastic                     Heteroscedastic             Heteroscedastic Heteroscedastic Heteroscedastic
                           Image                                 Ground truth                                 Homoscedastic                     Heteroscedastic             Heteroscedastic Heteroscedastic Heteroscedastic
 

Appendix H Keras Layer Code

"""Libraryofmethodstocomputeheteroscedasticclassificationpredictions."""
import abc
import numpy as np
import tensorflow.compat.v2 as tf
import tensorflow_probability as tfp
import heteroscedastic_lib_utils as het_lib_utils
class MCSoftmaxOutputLayer(tf.keras.layers.Layer):
  """BaseclassforMCheteroscesasticoutputlayers."""
  __metaclass__ = abc.ABCMeta
  def __init__(self, num_classes, logit_noise=het_lib_utils.LogitNoise.NORMAL,
               temperature=1.0, train_mc_samples=1000, test_mc_samples=1000,
               max_mc_samples_in_memory=1000, compute_pred_variance=False,
               name=’MCSoftmaxOutputLayer’):
    """CreatesaninstanceofMCSoftmaxOutputLayer.
␣␣␣␣Args:
␣␣␣␣␣␣num_classes:Integer.Numberofclassesforclassificationtask.
␣␣␣␣␣␣logit_noise:LogitNoiseinstance.Thenoisedistribution
␣␣␣␣␣␣␣␣assumedonthesoftmaxlogits.Possiblevalues:
␣␣␣␣␣␣␣␣LogitNoise.NORMAL,LogitNoise.LOGISTIC,LogitNoise.GUMBEL.
␣␣␣␣␣␣temperature:Floatorscalar

Tensor

representingthesoftmax
␣␣␣␣␣␣␣␣temperature.
␣␣␣␣␣␣train_mc_samples:ThenumberofMonte-Carlosamplesusedtoestimatethe
␣␣␣␣␣␣␣␣predictivedistributionduringtraining.
␣␣␣␣␣␣test_mc_samples:ThenumberofMonte-Carlosamplesusedtoestimatethe
␣␣␣␣␣␣␣␣predictivedistributionduringtesting/inference.
␣␣␣␣␣␣max_mc_samples_in_memory:Whenestimatingthepredictivedistributiona
␣␣␣␣␣␣␣␣‘Tensor‘ofshape[batch_size,max_mc_samples_in_memory,num_classes]
␣␣␣␣␣␣␣␣willbecomputed.Setmax_mc_samples_in_memoryashighaspossiblefor
␣␣␣␣␣␣␣␣efficientcomputationbutlowenoughsuchthatOOMerrorsareavoided.
␣␣␣␣␣␣compute_pred_variance:Boolean.Whethertoestimatethepredictive
␣␣␣␣␣␣␣␣variance.IfFalsethe__call__methodwilloutputNoneforthe
␣␣␣␣␣␣␣␣predictive_variancetensor.
␣␣␣␣␣␣name:String.Thenameofthelayerusedfornamescoping.
␣␣␣␣Returns:
␣␣␣␣␣␣MCOutputLayerinstance.
␣␣␣␣"""
    super(MCSoftmaxOutputLayer, self).__init__(name=name)
    self._num_classes = num_classes
    self._logit_noise = logit_noise
    self._temperature = temperature
    self._train_mc_samples = train_mc_samples
    self._test_mc_samples = test_mc_samples
    self._max_mc_samples_in_memory = max_mc_samples_in_memory
    self._compute_pred_variance = compute_pred_variance
    self._name = name
  def _compute_noise_samples(self, scale, num_samples, seed):
    """Utilityfunctiontocomputethesamplesofthelogitnoise.
␣␣␣␣Args:
␣␣␣␣␣␣scale:Tensorofshape[batch_size,total_mc_samples,...,
␣␣␣␣␣␣␣␣1ifnum_classes==2elsenum_classes].Scaleparametersofthe
␣␣␣␣␣␣␣␣distributionstobesampled.
␣␣␣␣␣␣num_samples:Integer.NumberofMonte-Carlosamplestotake.
␣␣␣␣␣␣seed:PythonintegerorTensorforseedingtherandomnumbergenerator.
␣␣␣␣Returns:
␣␣␣␣␣␣Tensor.Logitnoisesamplesofshape:[batch_size,num_samples,...,
␣␣␣␣␣␣␣␣1ifnum_classes==2elsenum_classes].
␣␣␣␣Raises:
␣␣␣␣␣␣ValueError:whenlogit_noiseisGumbelorLogisticandseedisaTensor.
␣␣␣␣"""
    if seed is None or isinstance(seed, int):
      if self._logit_noise == het_lib_utils.LogitNoise.NORMAL:
        dist = tfp.distributions.Normal(loc=tf.zeros_like(scale), scale=scale)
      elif self._logit_noise == het_lib_utils.LogitNoise.LOGISTIC:
        dist = tfp.distributions.Logistic(loc=tf.zeros_like(scale),
                                          scale=scale)
      else:
        dist = tfp.distributions.Gumbel(loc=tf.zeros_like(scale), scale=scale)
      # dist.sample(total_mc_samples) returns Tensor of shape:
      # [total_mc_samples, batch_size, d], here we reshape to:
      # [batch_size, total_mc_samples, d]
      tf.random.set_seed(seed)
      noise_samples = dist.sample(num_samples, seed=seed)
    else:
      seed_delta = (max(self._train_mc_samples, self._test_mc_samples)//
                    self._max_mc_samples_in_memory) + 1
      # avoiding seed collisions over multiple calls to _compute_noise_samples
      if self._logit_noise == het_lib_utils.LogitNoise.NORMAL:
        noise_samples = tf.random.stateless_normal(
            tf.concat([[num_samples], tf.shape(scale)], axis=0),
            [seed, seed + seed_delta],
            mean=tf.zeros_like(tf.expand_dims(scale, axis=0)),
            stddev=tf.expand_dims(scale, axis=0))
      else:
        raise ValueError(’Nonintegerseedsareonlysupportedfor’
                         ’LogitNoise.NORMAL’)
    return tf.transpose(
        noise_samples,
        tf.concat([[1, 0], tf.range(2, tf.rank(noise_samples))], 0))
  def _compute_mc_samples(self, locs, scale, num_samples, seed, use_argmax):
    """UtilityfunctiontocomputeMonte-Carlosamples.
␣␣␣␣Args:
␣␣␣␣␣␣locs:Tensorofshape[batch_size,total_mc_samples,...,
␣␣␣␣␣␣␣␣1ifnum_classes==2elsenum_classes].Locationparametersofthe
␣␣␣␣␣␣␣␣distributionstobesampled.
␣␣␣␣␣␣scale:Tensorofshape[batch_size,total_mc_samples,...,
␣␣␣␣␣␣␣␣1ifnum_classes==2elsenum_classes].Scaleparametersofthe
␣␣␣␣␣␣␣␣distributionstobesampled.
␣␣␣␣␣␣num_samples:Integer.NumberofMonte-Carlosamplestotake.
␣␣␣␣␣␣seed:PythonintegerorTensorforseedingtherandomnumbergenerator.
␣␣␣␣␣␣use_argmax:Boolean.Whethertousethesoftmaxorargmaxtocompute
␣␣␣␣␣␣␣␣thepredictivedistribution.
␣␣␣␣Returns:
␣␣␣␣␣␣Tensorofshape[batch_size,num_samples,...,
␣␣␣␣␣␣␣␣1ifnum_classes==2elsenum_classes].AlloftheMCsamples.
␣␣␣␣Raises:
␣␣␣␣␣␣ValueError:whenlogit_noiseisGumbelorLogisticandseedisset
␣␣␣␣"""
    locs = tf.expand_dims(locs, axis=1)
    noise_samples = self._compute_noise_samples(scale, num_samples, seed)
    latents = locs + noise_samples
    if self._num_classes == 2:
      if use_argmax:
        probs = tf.cast(latents > 0.5, latents.dtype)
      else:
        probs = tf.math.sigmoid(latents / self._temperature)
    else:
      if use_argmax:
        probs = tf.cast(
            tf.equal(latents, tf.reduce_max(latents, -1, keepdims=True)),
            latents.dtype)
      else:
        probs = tf.nn.softmax(latents / self._temperature, axis=-1)
    return probs
  def _compute_predictive_dist(self, locs, scale, total_mc_samples, seed,
                               use_argmax):
    """Utilityfunctiontocomputetheestimatedpredictivedistribution.
␣␣␣␣Args:
␣␣␣␣␣␣locs:Tensorofshape[batch_size,total_mc_samples,...,
␣␣␣␣␣␣␣␣1ifnum_classes==2elsenum_classes].Locationparametersofthe
␣␣␣␣␣␣␣␣distributionstobesampled.
␣␣␣␣␣␣scale:Tensorofshape[batch_size,total_mc_samples,...,
␣␣␣␣␣␣␣␣1ifnum_classes==2elsenum_classes].Scaleparametersofthe
␣␣␣␣␣␣␣␣distributionstobesampled.
␣␣␣␣␣␣total_mc_samples:Integer.NumberofMonte-Carlosamplestotake.
␣␣␣␣␣␣seed:PythonintegerorscalarTensorinitialseed,forseedingtherandom
␣␣␣␣␣␣␣␣numbergenerator.
␣␣␣␣␣␣use_argmax:Boolean.Whethertousethesoftmaxorargmaxtocompute
␣␣␣␣␣␣␣␣thepredictivedistribution.
␣␣␣␣Returns:
␣␣␣␣␣␣Tupeof(probs_mean,seeds_list,samples_list).␣␣WhereprobsisaTensor
␣␣␣␣␣␣ofshape[batch_size,...,1ifnum_classes==2elsenum_classes]-the
␣␣␣␣␣␣meanoftheMCsamples.seedsisalistofPythonintegerorscalarTensor
␣␣␣␣␣␣seedsusedtogeneratetherandomsamples.samplesisalistofintegers
␣␣␣␣␣␣containingthenumberofsamplestakenineachbatch,
␣␣␣␣␣␣sum(samples)==total_mc_samples.
␣␣␣␣Raises:
␣␣␣␣␣␣ValueError:whenlogit_noiseisGumbelorLogisticandseedisset
␣␣␣␣"""
    seeds_list = []
    samples_list = []
    if total_mc_samples <= self._max_mc_samples_in_memory:
      if self._compute_pred_variance and seed is None:
        seed = het_lib_utils.gen_int_seed()
      probs = self._compute_mc_samples(locs, scale, total_mc_samples, seed,
                                       use_argmax)
      seeds_list.append(seed)
      samples_list.append(total_mc_samples)
    else:
      # divide total_mc_samples into batches of samples that fit in memory
      # need (total_mc_samples // self._max_mc_samples_in_memory) batches of
      # size self._max_mc_samples_in_memory and maybe 1 additional batch of size
      # total_mc_samples % self._max_mc_samples_in_memory
      same_sample_batches = total_mc_samples // self._max_mc_samples_in_memory
      num_samples = [self._max_mc_samples_in_memory] * same_sample_batches
      sampling_weights = ([self._max_mc_samples_in_memory /
                           float(total_mc_samples)] * same_sample_batches)
      if total_mc_samples % self._max_mc_samples_in_memory > 0:
        remainder_samples = total_mc_samples % self._max_mc_samples_in_memory
        num_samples.append(remainder_samples)
        sampling_weights.append(remainder_samples / float(total_mc_samples))
      compute_mc_samples = tf.recompute_grad(self._compute_mc_samples)
      if seed is None:
        seed = het_lib_utils.gen_tensor_seed()
      for i, (sampling_ops, weight) in enumerate(
          zip(num_samples, sampling_weights)):
        seed = seed + 1  # unique seed for each set of samples
        unweighted_probs = compute_mc_samples(
            locs, scale, sampling_ops, seed, use_argmax)
        if i == 0:
          probs = weight * unweighted_probs
        else:
          probs = probs + weight * unweighted_probs
        seeds_list.append(seed)
        samples_list.append(sampling_ops)
    return tf.reduce_mean(probs, axis=1), seeds_list, samples_list
  def _compute_predictive_variance(self, mean, locs, scale, seeds_list,
                                   samples_list, use_argmax):
    """Utilityfunctiontocomputetheperclasspredictivevariance.
␣␣␣␣Args:
␣␣␣␣␣␣mean:Tensorofshape[batch_size,total_mc_samples,...,
␣␣␣␣␣␣␣␣1ifnum_classes==2elsenum_classes].Estimatedpredictive
␣␣␣␣␣␣␣␣distribution.
␣␣␣␣␣␣locs:Tensorofshape[batch_size,total_mc_samples,...,
␣␣␣␣␣␣␣␣1ifnum_classes==2elsenum_classes].Locationparametersofthe
␣␣␣␣␣␣␣␣distributionstobesampled.
␣␣␣␣␣␣scale:Tensorofshape[batch_size,total_mc_samples,...,
␣␣␣␣␣␣␣␣1ifnum_classes==2elsenum_classes].Scaleparametersofthe
␣␣␣␣␣␣␣␣distributionstobesampled.
␣␣␣␣␣␣seeds_list:ListofscalarTensorsforseedingtherandomnumber
␣␣␣␣␣␣␣␣generator.
␣␣␣␣␣␣samples_list:ListofIntegers.NumberofMonte-Carlosamplestotake.
␣␣␣␣␣␣use_argmax:Boolean.Whethertousethesoftmaxorargmaxtocompute
␣␣␣␣␣␣␣␣thepredictivedistribution.
␣␣␣␣Returns:
␣␣␣␣␣␣Tensorofshape:[batch_size,num_samples,...,
␣␣␣␣␣␣␣␣1ifnum_classes==2elsenum_classes].Estimatedpredictivevariance.
␣␣␣␣Raises:
␣␣␣␣␣␣ValueError:whenlogit_noiseisGumbelorLogisticandseedisaTensor.
␣␣␣␣"""
    mean = tf.expand_dims(mean, axis=1)
    total_samples = float(sum(samples_list))
    for i, (num_samples, seed) in enumerate(zip(samples_list, seeds_list)):
      mc_samples = self._compute_mc_samples(locs, scale, num_samples, seed,
                                            use_argmax)
      weight = num_samples / total_samples
      variance = tf.reduce_mean((mc_samples - mean)**2, axis=1)
      if i == 0:
        total_variance = weight * variance
      else:
        total_variance = total_variance + weight * variance
    return total_variance
  @abc.abstractmethod
  def _compute_loc_param(self, inputs):
    """Computeslocationparameterofthe"logits distribution".
␣␣␣␣Args:
␣␣␣␣␣␣inputs:Tensor.Theinputtotheheteroscedasticoutputlayer.
␣␣␣␣Returns:
␣␣␣␣␣␣Tensorofshape[batch_size,...,num_classes].
␣␣␣␣"""
    return
  @abc.abstractmethod
  def _compute_scale_param(self, inputs):
    """Computesscaleparameterofthe"logits distribution".
␣␣␣␣Args:
␣␣␣␣␣␣inputs:Tensor.Theinputtotheheteroscedasticoutputlayer.
␣␣␣␣Returns:
␣␣␣␣␣␣Tensorofshape[batch_size,...,num_classes].
␣␣␣␣"""
    return
  def __call__(self, inputs, training=True, argmax_preds=False, seed=None):
    """Computespredictiveandlogpredictivedistribution.
␣␣␣␣UsesMonteCarloestimateofsoftmaxapproximationtoheteroscedasticmodel
␣␣␣␣tocomputepredictivedistribution.O(mc_samples*num_classes).
␣␣␣␣Args:
␣␣␣␣␣␣inputs:Tensor.Theinputtotheheteroscedasticoutputlayer.
␣␣␣␣␣␣training:Boolean.Whetherwearetrainingornot.
␣␣␣␣␣␣argmax_preds:Boolean.Whethertotaketheargmaxorsoftmaxtocompute
␣␣␣␣␣␣␣␣thepredicitivedistribution.
␣␣␣␣␣␣seed:PythonintegerorscalarTensorforseedingtherandomnumber
␣␣␣␣␣␣␣␣generator.Onlyneedsbesetiftrain_mc_samples<
␣␣␣␣␣␣␣␣max_mc_samples_in_memoryasthiswillrecomputetheMonte-Carlosamples
␣␣␣␣␣␣␣␣onthebackwardpasstosavememory.Theseedensuresthatthesamples
␣␣␣␣␣␣␣␣onthebackwardpassarethesameasontheforwardpass.Ifsetthe
␣␣␣␣␣␣␣␣seedshouldbeuniquetoeachcalltothismethod.IfNone,allrandom
␣␣␣␣␣␣␣␣operationsactasiftheyareunseeded.
␣␣␣␣Returns:
␣␣␣␣␣␣AtupleofTensors(probs,log_probs,predictive_variance)ofthe
␣␣␣␣␣␣predictiveandlogpredictivedistributionandtheestimatedvarianceof
␣␣␣␣␣␣thepredictivedistribution.
␣␣␣␣"""
    with tf.name_scope(self._name):
      eps = 1e-10
      locs = self._compute_loc_param(inputs)
      scale = self._compute_scale_param(inputs)
      scale = tf.maximum(scale, eps)
      if training:
        total_mc_samples = self._train_mc_samples
      else:
        total_mc_samples = self._test_mc_samples
      use_argmax = argmax_preds and not training
      probs_mean, seeds_list, samples_list = self._compute_predictive_dist(
          locs, scale, total_mc_samples, seed, use_argmax)
      pred_variance = None
      if self._compute_pred_variance:
        pred_variance = self._compute_predictive_variance(
            probs_mean, locs, scale, seeds_list, samples_list, use_argmax)
      probs_mean = tf.clip_by_value(probs_mean, eps, 1.0 - eps)
      log_probs = tf.math.log(probs_mean)
      return probs_mean, log_probs, pred_variance
class MCSoftmaxDense(MCSoftmaxOutputLayer):
  """MonteCarloestimationofsoftmaxapproxtoheteroscedasticpredictions."""
  def __init__(self, num_classes, logit_noise=het_lib_utils.LogitNoise.NORMAL,
               temperature=1.0, train_mc_samples=1000, test_mc_samples=1000,
               max_mc_samples_in_memory=1000, loc_regularizer=None,
               compute_pred_variance=False, name=’MCSoftmaxDense’):
    """CreatesaninstanceofMCSoftmaxDense.
␣␣␣␣ThisisaMCsoftmaxheteroscedasticdropinreplacementfora
␣␣␣␣tf.keras.layers.Denseoutputlayer.e.g.simplychange:
␣␣␣␣logits=tf.keras.layers.Dense(...)(x)
␣␣␣␣to
␣␣␣␣_,logits,_=MCSoftmaxDense(...)(x)
␣␣␣␣Args:
␣␣␣␣␣␣num_classes:Integer.Numberofclassesforclassificationtask.
␣␣␣␣␣␣logit_noise:LogitNoiseinstance.Thenoisedistribution
␣␣␣␣␣␣␣␣assumedonthesoftmaxlogits.Possiblevalues:
␣␣␣␣␣␣␣␣LogitNoise.NORMAL,LogitNoise.LOGISTIC,LogitNoise.GUMBEL.
␣␣␣␣␣␣temperature:Floatorscalar‘Tensor‘representingthesoftmax
␣␣␣␣␣␣␣␣temperature.
␣␣␣␣␣␣train_mc_samples:ThenumberofMonte-Carlosamplesusedtoestimatethe
␣␣␣␣␣␣␣␣predictivedistributionduringtraining.
␣␣␣␣␣␣test_mc_samples:ThenumberofMonte-Carlosamplesusedtoestimatethe
␣␣␣␣␣␣␣␣predictivedistributionduringtesting/inference.
␣␣␣␣␣␣max_mc_samples_in_memory:Whenestimatingthepredictivedistributiona
␣␣␣␣␣␣␣␣‘Tensor‘ofshape[batch_size,max_mc_samples_in_memory,num_classes]
␣␣␣␣␣␣␣␣willbecomputed.Setmax_mc_samples_in_memoryashighaspossiblefor
␣␣␣␣␣␣␣␣efficientcomputationbutlowenoughsuchthatOOMerrorsareavoided.
␣␣␣␣␣␣loc_regularizer:Regularizerfunctionappliedtothekernelweights
␣␣␣␣␣␣␣␣matrixofthefullyconnectedlayercomputingthelocationparameterof
␣␣␣␣␣␣␣␣thedistributiononthelogits.
␣␣␣␣␣␣compute_pred_variance:Boolean.Whethertoestimatethepredictive
␣␣␣␣␣␣␣␣variance.IfFalsethe__call__methodwilloutputNoneforthe
␣␣␣␣␣␣␣␣predictive_variancetensor.
␣␣␣␣␣␣name:String.Thenameofthelayerusedfornamescoping.
␣␣␣␣Returns:
␣␣␣␣␣␣MCSoftmaxDenseinstance.
␣␣␣␣"""
    assert num_classes >= 2
    super(MCSoftmaxDense, self).__init__(
        num_classes, logit_noise=logit_noise, temperature=temperature,
        train_mc_samples=train_mc_samples, test_mc_samples=test_mc_samples,
        max_mc_samples_in_memory=max_mc_samples_in_memory,
        compute_pred_variance=compute_pred_variance, name=name)
    self._loc_layer = tf.keras.layers.Dense(
        1 if num_classes == 2 else num_classes, activation=None,
        kernel_regularizer=loc_regularizer, name=’loc_layer’)
    self._scale_layer = tf.keras.layers.Dense(
        1 if num_classes == 2 else num_classes,
        activation=tf.abs, name=’scale_layer’)
  def _compute_loc_param(self, inputs):
    """Computeslocationparameterofthe"logits distribution".
␣␣␣␣Args:
␣␣␣␣␣␣inputs:Tensor.Theinputtotheheteroscedasticoutputlayer.
␣␣␣␣Returns:
␣␣␣␣␣␣Tensorofshape[batch_size,...,num_classes].
␣␣␣␣"""
    return self._loc_layer(inputs)
  def _compute_scale_param(self, inputs):
    """Computesscaleparameterofthe"logits distribution".
␣␣␣␣Args:
␣␣␣␣␣␣inputs:Tensor.Theinputtotheheteroscedasticoutputlayer.
␣␣␣␣Returns:
␣␣␣␣␣␣Tensorofshape[batch_size,...,num_classes].
␣␣␣␣"""
    return self._scale_layer(inputs)
Listing 1: heteroscedastic_lib.py
"""Utilityfunctionsforheteroscedastic_lib.py."""
import enum
import random
import numpy as np
from scipy.stats import norm
import tensorflow.compat.v2 as tf
import tensorflow_probability as tfp
class LogitNoise(enum.Enum):
  NORMAL = 1
  LOGISTIC = 2
  GUMBEL = 3
def inverse_gumbel_cdf(x):
  """ComputestheinverseoftheGumbeldistribution’sCDF.
␣␣Args:
␣␣␣␣x:Tensor.ValuewhichisdistributedGumbel.
␣␣Returns:
␣␣␣␣Tensor.TheinverseoftheGumbelCDF.
␣␣"""
  return -tf.math.log(-tf.math.log(x))
def log_gumbel_cdf(x, smallest_value=2.225e-307):
  """ComputesthelogoftheGumbeldistribution’sCDF.
␣␣Args:
␣␣␣␣x:Tensor.ValuewhichisdistributedGumbel.
␣␣␣␣smallest_value:Float.Smallestpositivefloat64value.
␣␣Returns:
␣␣␣␣Tensor.ThelogoftheGumbelCDF.
␣␣"""
  smallest_value = tf.constant(smallest_value, dtype=tf.float64)
  x = tf.maximum(x, inverse_gumbel_cdf(smallest_value * 1e10))
  return -tf.exp(-x)
def log_normal_cdf(x):
  """ComputesthelogoftheNormaldistribution’sCDF.
␣␣Args:
␣␣␣␣x:Tensor.ValuewhichisdistributedNormal.
␣␣Returns:
␣␣␣␣Tensor.ThelogNormalCDF.
␣␣"""
  dist = tfp.distributions.Normal(loc=tf.constant(0.0, dtype=tf.float64),
                                  scale=tf.constant(1.0, dtype=tf.float64))
  return dist._log_cdf(x)  # pylint: disable=protected-access
def log_sum_exp(x):
  """Calculatesthelog_sum_expoverthelastdimensionofaTensor.
␣␣Args:
␣␣␣␣x:Tensor.
␣␣Returns:
␣␣␣␣Tensor.Ifxisofshape[batch,...,d]thenreturnedTensorisofshape
␣␣␣␣[batch,...,1].
␣␣"""
  max_arg = tf.reduce_max(x, -1)
  m_arg_same_dims = tf.reduce_max(x, -1, keepdims=True)
  return max_arg + tf.math.log(tf.reduce_sum(tf.exp(x - m_arg_same_dims), -1))
def gen_int_seed():
  return random.randrange(2**63 - 1)
def gen_tensor_seed():
  return tf.py_function(gen_int_seed, [], tf.int64)
Listing 2: heteroscedastic_lib_utils.py