Log In Sign Up

It Is Likely That Your Loss Should be a Likelihood

by   Mark Hamilton, et al.

We recall that certain common losses are simplified likelihoods and instead argue for optimizing full likelihoods that include their parameters, such as the variance of the normal distribution and the temperature of the softmax distribution. Joint optimization of likelihood and model parameters can adaptively tune the scales and shapes of losses and the weights of regularizers. We survey and systematically evaluate how to parameterize and apply likelihood parameters for robust modeling and re-calibration. Additionally, we propose adaptively tuning L_2 and L_1 weights by fitting the scale parameters of normal and Laplace priors and introduce more flexible element-wise regularizers.


page 4

page 10

page 14


Joint Modelling of Location, Scale and Skewness Parameters of the Skew Laplace Normal Distribution

In this article, we propose joint location, scale and skewness models of...

Beta Generalized Normal Distribution with an Application for SAR Image Processing

We introduce the beta generalized normal distribution which is obtained ...

The Balakrishnan Alpha Skew Laplace Distribution: Properties and Its Applications

In this study by considering Balakrishnan mechanism a new form of alpha ...

On parameters transformations for emulating sparse priors using variational-Laplace inference

So-called sparse estimators arise in the context of model fitting, when ...

P2SGrad: Refined Gradients for Optimizing Deep Face Models

Cosine-based softmax losses significantly improve the performance of dee...

Learning State-Dependent Losses for Inverse Dynamics Learning

Being able to quickly adapt to changes in dynamics is paramount in model...

Discovering bursts revisited: guaranteed optimization of the model parameters

One of the classic data mining tasks is to discover bursts, time interva...

1 Introduction

Choosing the right loss matters. Many common losses arise from likelihoods, such as the squared error loss from the normal distribution and the cross entropy loss from the softmax distribution. The same is true of regularizers, where arises from a normal prior and

from a Laplace prior. Losses derived from likelihoods turn the problem into a choice of distribution which can be informed by data noise and error tolerance. Standard losses and regularizers implicitly fix key distribution parameters, limiting flexibility. For instance, the squared error corresponds to fixing the normal variance at one. This work examines how to jointly optimize distribution parameters with model to select losses and regularizers that encourage generalization, calibration, and robustness to outliers. We explore three key likelihoods the normal, softmax, and the robust regressorlikelihood

(barron2019loss). Additionally, we cast adaptive priors in the same light and introduce adaptive regularizers. In summary:

  1. We systematically survey and evaluate global, data, and predicted likelihood parameters for robustness, adaptive regularization, and calibration.

  2. We propose adaptive normal and Laplace priors on the model parameters.

  3. We show that predicted likelihood parameters are efficient and effective.

2 Background

Notation  We consider a dataset of points and targets indexed by

. Targets for regression are real numbers and targets for classification are one-hot vectors. The model

with parameters makes predictions . A loss measures the quality of the prediction given the target. To learn model parameters we solve the following loss optimization:


A likelihood measures the quality of the prediction as a distribution over given the target and likelihood parameters . We use to the negative log-likelihood (NLL), and the likelihood interchangeably since both have the same optima. We define the full likelihood optimization:


to jointly learn model and likelihood parameters. “Full” indicates the inclusion of , which control the distribution and induced NLL loss. We focus on full likelihood optimization in this work. We note that the target, , is the only supervision needed to optimize model and likelihood parameters, and respectively. Additionally though the shape and scale varies with , reducing the error always reduces the NLL for our distributions.

Figure 1:

Normal PDF (left) and NLL (right). Optimizing likelihood parameters adapts the loss without manual hyperparameter tuning to balance accuracy and certainty.

Distributions Under Investigation  This work considers the normal likelihood with variance (bishop2006pattern; hastie2009elements), the softmax likelihood with temperature (hinton2015distilling), and the robust likelihood (barron2019loss) with shape and scale that control the scale and shape of the likelihood. Figure 1 shows how affects the Normal distribution and its NLL.

The normal likelihood has terms for the residual and the variance as


with scaling the distribution. The squared error is recovered from the corresponding NLL with .

The softmax defines a categorical distribution defined by scores for each class as


with the temperature, , adjusting the entropy of the distribution. The softmax NLL is The classification cross-entropy loss, , is recovered by substituting in the respective NLL. We state the gradients of these likelihoods with respect to their and in Section A of the supplement.

The robust loss and its likelihood are


with shape , scale , and normalization function . This likelihood generalizes the normal, Cauchy, and Student’s t distributions.

3 Related Work

Likelihood optimization follows from maximum likelihood estimation

(hastie2009elements; bishop2006pattern)

, yet is uncommon in practice for fitting deep regressors and classifiers for discriminative tasks. However

(kendall2017uncertainties; kendall2018multi; barron2019loss; saxena2019data) optimize likelihood parameters to their advantage yet differ in their tasks, likelihoods, and parameterizations. In this work we aim to systematically experiment, clarify usage, and encourage their wider adoption.

Early work on regressing means and variances (nix1994estimating) had the key insight that optimizing the full likelihood can fit these parameters and adapt the loss. Some recent works use likelihoods for loss adaptation, and interpret their parameters as the uncertainty (kendall2017uncertainties; kendall2018multi), robustness (kendall2017uncertainties; barron2019loss; saxena2019data), and curricula (saxena2019data) of losses. (barron2019loss) define a generalized robust regression loss, , to jointly optimize the type and degree of robustness with global, data-independent, parameters. (kendall2017uncertainties) predict variances for regression and classification to handle data-dependent uncertainty. (kendall2018multi) balance multi-task loss weights by optimizing variances for regression and temperatures for classification. These global parameters depend on the task but not the data, and are interpreted as inherent task uncertainty. (saxena2019data) define a differentiable curriculum for classification by assigning each training point its own temperature. These data parameters depend on the index of the data but not its value. We compare these different likelihood parametizations across tasks and distributions.

Figure 2: An image classifier with three different types of temperature conditioning: global, model, and data.
Figure 3: Performance of (left) and

(middle) regularized linear regression on a 500 dimensional synthetic dataset where the true parameters,

, are known. Dynamic Ridge (D-Ridge) and D-LASSO regression find the regularization strength that best estimates the true parameters. M-LASSO outperforms any single global regularization strength and does not shrink informative weights. (right) Performance of adaptive

regularization methods as a function of true model sparsity. In all cases, Multi-LASSO outperforms other methods by orders of magnitude.

4 Likelihood Parameter Types

We explore the space of likelihood parameter representations for model optimization and inference. We note two key choices to make for representing , conditioning and dimensionality.

Conditioning  We represent the likelihood parameters by three functional classes: global, data, and predicted. Global parameters, , are independent of the data and model and define the same likelihood distribution for all points. Data parameters, , are conditioned on the index, , of the data, , but not its value. Every training point is assigned an independent likelihood parameter, that define different likelihoods for each training point. Predicted parameters, , are determined by a model, , with parameters (not to be confused with the task model parameters ). Global and predicted parameters can be used during training and testing, but data parameters are only assigned to each training point and are undefined for testing. We show a simple example of predicted temperature in Figure 4, and a illustration of the parameter types in Figure 2.

Dimensionality  The dimensionality, , of likelihood parameters can vary with the dimension of the task prediction, . For example, for image regression one can use a single likelihood parameter for each image , RGB image channel , or even every pixel . Dimensionality and Conditioning of likelihood parameters can interact. For example, data parameters with would result in additional parameters, where is the size of the dataset. This can complicate implementations and slow down optimization due to disk I/O when their size exceeds memory. Table 7 in the appendix contrasts the computational requirements of different likelihood parameter types.

Figure 4:

A synthetic logistic regression experiment. Regressing softmax temperature reduces the influence of outliers (blue, bottom-left), by locally raising temperature. The jointly optimized model achieves a more accurate classification.

Param. Dim MSE Time Mem Global 225.8 1.04 KB Data 244.2 2.70 GB Pred. 228.5 1.04 MB Global 231.1 1.08 MB Data 252.6 9.42 GB Pred. 222.3 1.08 MB Table 1: MSE, Time, and Memory increase (compared to standard normal likelihood) for reconstruction by variational auto-encoders with different parameterizations of the robust loss, . Predicted likelihood parameters yield more accurate reconstruction models.

5 Applications

5.1 Adaptive Regularization with Prior Parameters

We propose adaptive regularizers for the model parameters that optimize prior distribution parameters and tune the degree of regularization. The Normal (Ridge, L2) and Laplace (LASSO, L1) priors, with scale parameters and , regularize model parameters for small magnitude and sparsity respectively (hastie2009elements). The degree of regularization,

, is a hyperparameter of the regularized loss function:


We note that cannot be chosen by optimization, because it admits a trivial minimum at . In the linear case, one can select this weight efficiently using Least Angle Regression (efron2004least). However, in general is usually learned through expensive cross validation methods. Instead, we retain the prior with its scale parameter, and jointly optimize over the full likelihood:


This approach, the Dynamic Lasso (D-LASSO), admits no trivial solution for the prior parameter , and must balance the effective regularization strength with . D-LASSO allows the degree of regularization to be selected by gradient descent, rather than expensive black-box search. In Figure 3 (left) and (middle) we show that this approach, and its Ridge equivalent, yield ideal settings of the regularization strength. Figure 3 (right) shows D-LASSO converges to the best LASSO regularization strength for a variety of true-model sparsities. As a further extension, we replace the global or with a or for each to locally adapt regularization (Multi-Lasso). This consistently outperforms any global setting of the regularization strength and shields important weights from undue shrinkage 3 (middle).

5.2 Robust Modeling

Data in the wild is noisy, and machine learning models need to cope with input and label corruption. The standard mean squared error (MSE) loss is highly susceptible to outliers due to its fixed variance

(huber2004robust). Likelihood parameters transform standard methods into robust methods without expensive outer-loop of model fitting such as RANSAC (fischler1981random) and Theil-Sen (theil1992rank). Figure 4 demonstrates this on a simple classification dataset and Figure 6 in the appendix compares our approach to other robust regressors. Likelihood parameters yield comparable results at a fraction of the cost. Furthermore, regressing variance improves accuracy as measured by MSE and NLL on the test set, and reduces the miscalibration (CAL) of the resulting regressors. Tables 3 and 4 in the appendix show this for linear and deep regressors on a suite of datasets used in (kuleshov2018accurate).

Often one must consider broader classes of likelihoods such as the general robust loss, (barron2019loss). We reproduce the variational auto-encoding (kingma2014adam) (VAE) experiments from (barron2019loss) on faces from the CelebA dataset (liu2015faceattributes), and compare with our model-based likelihood parameters, and data parameters (saxena2019data). We explore the two natural choices of parameter dimensionality, a single set of parameters for the whole image, and a set of parameters for each pixel and channel. More details on experimental conditions, datasets, and models are provided in Sections J and K in the appendix.

5.3 Re-calibration

The work of (guo2017calibration) shows that modern networks are accurate, yet systematically overconfident, a phenomena called mis-calibration. We investigate the role of optimizing likelihood parameters to re-calibrate models. (guo2017calibration) introduce Global Scaling (GS), which re-calibrates classifiers with a learned global parameter, in the loss function: . (kuleshov2018accurate) use an Isotonic regressor to correct confidence of regressors. Vector Scaling (VS) (guo2017calibration), a multivariate generalization of Platt scaling (platt1999probabilistic), learns a vector,

, to re-weight logits:


Using likelihood parameters we introduce three new re-calibration methods. Linear Scaling (LS) learns a linear mapping, , to transform logits to a softmax temperature: . Linear Feature Scaling (LFS), learns a linear mapping, , to transform the features prior to the logits, , to a softmax temperature: . Finally, Deep Scaling (DS) learns a nonlinear network, , to transform these features, , into a temperature: .

Table 2 evaluates the effectiveness of these approaches on recalibrating deep vision architectures across a variety of datasets as measured by the Expected Calibration Error (ECE) (guo2017calibration). LS and LFS tend to outperform both GS and VS. This demonstrates that richer likelihood parametrizations can improve calibration akin to how richer models can improve prediction. We choose GS as a baseline because (guo2017calibration)

show it outperforms Vector Scaling, Bayesian Binning into Quantiles, Histogram binning, and Isotonic Regression

(zadrozny2001obtaining; zadrozny2002transforming; naeini2015obtaining). In Table 5 and Figure 8 in the appendix we show that these findings extend to regressors.

Model Dataset Base GS VS LS LFS
RN50 CIFAR-10 .250 .046 .037 .018 .018
RN50 CIFAR-100 .642 .035 .044 .030 .173
RN50 SVHN .072 .029 .022 .009 .009
RN50 ILSVRC .430 .019 .023 .026 .015
DN121 CIFAR-10 .253 .039 .034 .028 .028
DN121 CIFAR-100 .537 .024 .024 .014 .031
DN121 SVHN .079 .022 .017 .011 .010
DN121 ILSVRC .229 .021 .019 .043 .019
Table 2: Comparison of calibration methods by ECE for ResNet-50 (RN50) and DenseNet-121 (DN121) architectures on test data. Our predicted likelihood parameter methods: Linear Scaling (LS) and Linear Feature Scaling (LFS) outperform global scaling (GS) as well as vector scaling (VS). In all cases our methods reduce miscalibration with comparable computation time as GS.

6 Conclusion

Optimizing the full likelihood can improve model quality by adapting losses and regularizers. Full likelihoods are agnostic to the architecture, optimizer, and task, which makes them simple substitutes for standard losses. Global, data, and predicted likelihood parameters offer different degrees of expressivity and efficiency. In particular, predicted parameters adapt the likelihood to each data point during training and testing without significant time and space overhead. More generally, we hope this work encourages joint optimization of model and likelihood parameters, and argue it is likely that your loss should be a likelihood.



Appendix A Gradient Optimization of Variance and Temperature

For completeness, we state the derivative of the normal NLL with respect to the variance (9) and the derivative of the softmax NLL with respect to the temperature (10). The normal NLL is well-known, and its gradient w.r.t. the variance was first used to fit networks for heteroskedastic regression (nix1994estimating) and mixture modeling (bishop1994mixture). The softmax with temperature is less widely appreciated, and we are not aware of a reference for its gradient w.r.t. the temperature. For this derivative, recall that the softmax is the gradient of , and see (4).


Appendix B Understanding Where Uncertainty is Modelled

For classifiers with a parametrized temperature, models have the “choice” to store uncertainty information in either the model that can reduce the size of the logit vector, , or the likelihood parameters which can scale the temperature parameter, . Note that for regressors, this information can only be stored in . The fact that uncertainty information is split between the model and likelihood can sometimes make it difficult to interpret temperature as the sole detector of outliers. In particular, if the uncertainty parameters, , train slower than the model parameters, the network might find it advantageous to move critical uncertainty information to the model. This effect is illustrated by Figure 5, which shows that with data parameters on a large dataset, the uncertainty required to detect outliers moves into the model.

Figure 5: Area Under PR Curve of Softmax Temperature as a data corruption classifier on CIFAR10 with synthetic corruptions. Data temperatures (DT) train slower than model temperatures (MT), hence some uncertainty is modelled by .

Appendix C Robust Tabular Regression

We note that the same approaches of 5.2 also apply to linear and deep regressors. Tables 3 and 4 show these findings respectively.

Dataset Base Temp Base Temp Base Temp

0.097 0.007 0.057 0.020 0.528 17.497
kin. 0.022 0.007 0.033 0.016 4.749 0.227
bank 0.273 0.002 0.095 0.009 0.225 -1.045
wine 0.013 0.002 0.031 0.010 7.707 0.332
mpg 0.093 0.016 0.057 0.030 0.217 -0.183
cpu 0.359 0.129 0.111 0.044 0.390 -2.812
soil 0.122 0.023 0.064 0.026 1.309 -2.427
fried 0.077 0.001 0.051 0.006 2.624 -0.195
Table 3: Effect of model based likeihood parameters on Calibration (CAL) (kuleshov2018accurate), MSE and NLL evaluated on unseen data for in linear regressors.

Dataset Base Temp Base Temp Base Temp
crime 0.018 0.146 0.028 0.088 220.1 478.4
kin. 0.122 0.001 0.064 0.006 0.061 -0.471
bank 0.417 0.001 0.127 0.008 1.361 -1.464
wine 0.023 0.003 0.032 0.011 1.614 0.302
mpg 0.208 0.006 0.083 0.020 0.307 5.050
cpu 0.554 0.021 0.150 0.022 -0.102 12.218
soil 0.602 0.307 0.160 0.100 -0.131 -4.078
fried 0.472 0.000 0.129 0.002 0.301 -1.039
Table 4: Effect of predicted likelihood parameters on Calibration (CAL) (kuleshov2018accurate), MSE, and NLL evaluated on test data for deep regressors. Linear results are in Table 3 of the supplement.

In addition to investigating how parametrized likelihoods affect deep models, we also perform the same comparisons and experiments for linear models. In this domain, we find that the same properties still hold, and because of the limited number of parameters, these models often benefit significantly more from likelihood parameters. In Figure 6 we show that parametrizing Gaussian scale leads to similar robustness properties as the Theil-Sen, RANSAC, and Huber methods at a fraction of the computational cost. Furthermore, we note that likihood parameters could also be applied to Huber regression to adjust its scale as is the case for Normal variance, . In Table 3 we show that adding a learned normal scale regressor to linear regression can improve calibration and generalization.

Figure 6:

Comparison of robust regression methods. The choice of likelihood parameter conditioner has a significant effect on the inductive bias of outlier detection. Data Parameter Least Squares DPLS does not yield the appropriate inductive bias in this particular large X deviant dataset. The XY conditioned Model Parameters, MPLS-XY, induce a helpful bias that rivals the performance of RANSAC at a fraction of the fitting time

Appendix D Visualizing Predicted Shape and Scale for Auto-Encoding

In Section 5.2 we experiment with variational auto-encoding by the generalized robust loss . The likelihood corresponding to has parameters for shape () and scale (). When these parameters are predicted, by regressing them as part of the model, they can vary with the input to locally adapt the loss. In Figure 7 we visualize the regressed shape and scale of each pixel for images in the CelebA validation set. Note that only predicted likelihood parameters can vary in this way, since data parameters are not defined during testing.

Figure 7: With predicted likelihood parameters, the shape and scale of the distribution (and hence loss) is adapted to each pixel by regression from the input. The input, its reconstruction, and the predicted shape () and scale () are shown on CelebA validation images. The likelihood parameters do indeed vary with the input images. Across the whole validation set, the predicted shape is generally in the range and the predicted scale is generally in the range .

Appendix E Calibrating Regressors

Uncalibrated Isotonic GS LS DS
crime 0.3624 0.3499 0.0693 0.0125 0.0310
kinematics 0.0164 0.0103 0.0022 0.0021 0.0032
bank 0.0122 0.0056 0.0027 0.0024 0.0020
wine 0.0091 0.0108 0.0152 0.0131 0.0064
mpg 0.2153 0.2200 0.1964 0.1483 0.0233
cpu 0.0862 0.0340 0.3018 0.2078 0.1740
soil 0.3083 0.3000 0.3130 0.3175 0.3137
fried 0.0006 0.0002 0.0002 0.0002 0.0002
Table 5: Comparison of regression calibration methods as evaluated by their calibration error as defined in (kuleshov2018accurate). Model based likelihood parameters often outperform other methods. We additionally perform the experiment on linear models and show the results in Table 6.

The results of Table 5 also hold for linear models, though the improvements are less significant as linear regressors often do not have the capacity to over-fit and miscalibrate. Table 6 show these results.

Dataset Uncalibrated Isotonic GS LS DS
crime 0.0099 0.0056 0.1193 0.0104 0.0177
kinematics 0.0138 0.0035 0.0137 0.0137 0.0192
bank 0.0017 0.0034 0.0036 0.0049 0.0055
wine 0.0007 0.0053 0.0015 0.0011 0.0006
mpg 0.0081 0.0300 0.0083 0.0075 0.0239
cpu 0.2743 0.0668 0.2970 0.2812 0.2371
soil 0.0289 0.0200 0.0258 0.0271 0.0244
fried 0.0011 0.0012 0.0011 0.0013 0.0029
Table 6: Comparison of calibration methods on linear models fit to a variety of dataset

Appendix F Deep Regressors Miscalibrate

The work of (kuleshov2018accurate) establishes Isotonic regression as a natural baseline for regressors. In our experiments on classifiers in Table 2, and on regressors in table 5, we found that additional likelihood modeling capacity for temperature and scale was beneficial for re-calibration across several datasets. This demonstrates that the space of deep network calibration is rich, but with an inductive bias towards likelihood parametrization. We also find that deep regressors suffer from the same over-confidence issue that plagues deep classifiers (guo2017calibration). Figure 8 in the supplement shows this effect. More details on experimental conditions, datasets, and models are provided in sections J and K in the Supplement.

We confirm an that deep classifiers miscalibrate as a function of layer size, which is analogous to the results for classifiers found by (guo2017calibration). We plot regression calibration error as a function of network layer size for a simple single layer deep network on a synthetic regression dataset. As layer size grows the network has the capacity to over-fit the training data and hence under-estimate’s its own errors. We note that this effect appears on the other datasets reported, and across other network architectures. We posit that overconfidence will occur with any likelihood distribution as this is a symptom of over-fitting.

Figure 8: Regression Calibration Error as defined in (kuleshov2018accurate)

by network layer size. Experiments performed on a synthetic dataset with a 1 layer deep ReLU network. This shows regression overconfidence and miscalibration are a function of a network’s ability to over-fit on data. A finding that agrees with


Appendix G Improving Optimization Stability

When modelling Softmax temperatures and Gaussian scales with dedicated networks we encountered significant instability due to exploding gradients at low temperature scales and vanishing gradients at high temperature scales. We discovered that it was desire-able to have a function mapping that was smooth, everywhere-differentiable, and bounded slightly above to avoid exploding gradients. Though other works use the exponential function to map “temperature logits” to , we found that accidental exponential temperature growth could squash gradients. When using exponentiation, a canonical instability arose when the network overstepped towards due to momentum, then swung dramatically towards . Here, it lost all gradients and could not recover.

To counteract this behavior we employ a shifted softplus function that is re normalized so :


where is a small constant that serves as a smooth lower bound to the temperature. This function decays to when , and increases linearly when

, hence avoids the vanishing gradient problem. Figure

9 demonstrates the importance of the offset in Equation 11 with ResNet50 on Cifar100 calibration. We also found it important to normalize the features before processing with the layers dedicated to scales and temperatures. For shifted softmax offsets, we frequently employ .

Figure 9: Learning curves for calibrating ResNet50 on Cifar100 with different softplus offsets. Having a slight offset in the function that transforms logits to temperatures or Gaussian scales greatly improves stability and convergence time. Without this offset, the optimization frequently diverged.

Appendix H Space and Time Characteristics of Likelihood Parameters

Type Space Time
Table 7: Space and time requirements of likelihood parameter types by the number of parameters , training set size , feature dimension , and the computation time for forward , gradient , and storage . Space is measured for all points, while time is measured for one point.

Appendix I Outlier Detection

Figure 10: The data with the lowest (top) and highest (bottom) predicted temperatures in the SVHN dataset. High temperature entries are blurry, cropped poorly, and generally difficult to classify.

Likelihood parameter prediction not only improves optimization, but can also serve useful purposes during inference. In particular, we show that the likelihood parameters can be thresholded to detect outliers at both training and testing time. Auditing the model’s temperature or noise parameters can help practitioners spot erroneous labels and poor quality examples. Figure 10 in the Supplement shows that temperature correlates strongly with blurry, dark and difficult examples on the Street View House Number (SVHN) dataset. If the likelihood parameter network has access to both X and Y, it can also detect errors in labelling, and we found several mis-labelled data in the SVHN dataset with this approach.

The choice of conditioner determines how likelihood parameters can be used at test time. Data parameters are only defined for the training data, and do not apply to test data. This can cause poor calibration and other artifacts during inference. Furthermore, classifiers have a “choice” to store uncertainty in the likelihood parameters, , or the task model . On large mini-batched datasets, data parameters receive fewer updates, and uncertainty information moves into the model. This can render data parameters less effective for outlier detection and we show this effect in Figure 5 of the supplement.

We quantitatively compare the ability of likelihood parameters to separate outliers from un-corrupted data against several existing outlier detection methods. We evaluate the Elliptic Envelope (EE) (rousseeuw1999fast)

, 1-Class Support Vector Machine (1SVM)

(scholkopf2000support), Isolation Forest (liu2008isolation), and the Local Outlier Factor (LOF) (breunig2000lof) trained on both features and labels. Our introduced methods include model (MP), and Data (DP) parametrized normal scales with Linear (L) and Deep (D) regressors. Likelihood parameters outperform baselines across these datasets.

For “ground truth” outlier labels we corrupt datasets without changing high-level dataset statistics. More specifically, we append noisy rows to the dataset formed by taking the data where of the scaled features and randomly re-sampling new labels from the full dataset. This jointly corrupts the data, , and target, , for a small fraction of the dataset. We found that our conclusions held with other types of corruptions such as randomly scaling inputs and labels, or adding noise. We treat outlier-detection as a supervised classification problem by reporting the area under the Precision-Recall curve for outlier scores.

Test Train
crime .20 .30 .19 .41 .38 .19 .20 .19 .20 .47 .47 .78 .54
kinematics .12 .14 .16 .48 .48 .12 .14 .16 .13 .50 .50 .77 .89
bank .12 .15 .17 .52 .50 .12 .14 .16 .14 .47 .49 .94 .95
wine .12 .09 .10 .41 .51 .11 .09 .10 .12 .29 .49 .65 .66
mpg .19 .17 .25 .41 .43 .15 .15 .20 .14 .54 .52 .85 .86
soil .15 .15 .15 .43 .51 .15 .15 .15 .15 .42 .50 .87 1.0
fried .12 .14 .16 .50 .49 .12 .14 .16 .14 .50 .50 .85 .94
Table 8: Comparison of several outlier detection methods to Deep (D) and Linear (L) model-based temperature (MT) and data temperatures (DT) we additionally perform this experiment on linear models

Appendix J Datasets

The regression datasets used are sourced from the UCI Machine Learning repository (dua2017uci), (soil) (soil), and (fried)

(fried). Inputs and targets are scaled to unit norm and variance prior to fitting for all regression experiments and missing values are imputed using scikit-learn’s “SimpleImputer”


. Large-scale image classification experiments leverage Tensorflow’s Dataset APIs that include the SVHN,


, ImageNet

(deng2009imagenet), CIFAR-100, CIFAR-10 (krizhevsky2009learning), and CelebA (liu2015faceattributes)

datasets. The datasets used in ridge and lasso experiments are 500 samples of 500 dimensional normal distributions mapped through linear functions with additive gaussian noise. Linear transformations use

weights and LASSO experiments use sparse transformations.

Appendix K Models

k.1 Constraint

Many likelihood parameters have constrained domains, such as the normal variance . To evade the complexity of constrained optimization, we define unconstrained parameters and choose a transformation with inverse to map to and from the constrained . For positivity, / parameterization is standard (kendall2017uncertainties; kendall2018multi; saxena2019data). However, this parameterization can lead to instabilities and we use the softplus, , instead. Shifting the softplus further improves stability (see Figure 9). For the constrained interval we use affine transformations of the sigmoid (barron2019loss).

k.2 Regularization

Like model parameters, likelihood parameters can be regularized. For unsupervised experiments we consider weight decay

, gradient clipping

, and learning rate scaling for learning rate and multiplier . We inherit the existing setting of weight decay for model parameters and clip at gradient norms at . We set the learning rate scaling multiplier to .

k.3 Input

The regressor can take the data, , the prediction, , or a representation from the task model, . Choosing , makes the regressor independent of the task model, and using , makes the regressor independent of the data given the task model. The latter is often referred to as meta-recognition (scheirer2011meta). Using an intermediate representation strikes a balance between the data and the task.

We focus on the deepest feature representation as input, from the penultimate layer of the network. This representation has a linear relationship to the prediction, in addition to a higher dimension, and should yield information on both prediction and data.

k.4 Architecture

The regressor can be a shallow, linear function or a deep, nonlinear function. To keep the likelihood prediction closely related to the task prediction, we only consider linear regressors and one to two-layer nonlinear regressors. Given the task model representation, the likelihood regressor is still a nonlinear function of the data, even if the regressor itself is linear.

Normalization can help control for differences in the dimension and magnitude of data and features. We normalize the input to the regressor to simplify application to different tasks, stabilize convergence, and promote invariance to the dimension and magnitude of the input. While there are a number of normalization schemes to choose from, we select L2 normalization because it is efficient and deterministic.

k.5 Optimization

The regressor parameters,

, are optimized by backpropagation through the regressed likelihood parameters

. The weights in are initialized by the standard Glorot (glorot2010understanding) or He (he2015delving) techniques with mean zero. The biases in are initialized by the inverse parameter constraint function, , to the desired setting of . The default for variance and temperature is , for equality with the usual squared error and softmax cross-entropy.

Regressor learning can be end-to-end or isolated. In end-to-end learning, the gradient w.r.t. the likelihood is backpropagated to the regressor’s input. Whereas in isolated learning, the gradient is stopped at the input of the likelihood parameter model. Isolated learning of predicted parameters is closer to learning global and data parameters, which are independent of the task model, and do not affect model parameters.

k.6 Experimental Details

Regression experiments utilize Keras’ layers API with rectified linear unit (ReLU) activations and Glorot uniform initialization

(relu; glorot). We use Keras implementations of DenseNet-121 (huang2017densely) and ResNet-50 (he2016deep) with default initializations. For image classifiers, we use Adam optimization with (kingma2014adam)

and train for 300 epoch with a batch size of 512. Data parameter optimization uses Tensorflow’s implementation of sparse RMSProp

(tieleman2012lecture). We train regression networks with Adam and

for 3000 steps without minibatching. Deep regressors have a single hidden layer with 10 neurons, and recalibrated regressors have 2 hidden layers. We constrain Normal variances and softmax temperatures using the affine softplus

and respectively. Adaptive regularizer scales use parametization. We run experiments on Ubuntu 16.04 Azure Standard NV24 virtual machines (24 CPUs, 224 Gb memory, and 4 M60 GPUs) with Tensorflow 1.15 (tensorflow).

In our VAE experiments, our likelihood parameter model is a convolution on last hidden layer of the decoder, which has the same resolution as the output. The low and high dimensional losses use the same convolutional regressor, but the 1 dimensional case averages over pixels. In the high dimensional case, the output has three channels (for RGB), with six channels total for shape and scale regression. We use the same non-linearities to constrain the shape and scale outputs to reasonable ranges as in (barron2019loss): an affine sigmoid to keep the shape and the softplus to keep scale . Table 1 gives the results of evaluating each method by MSE on the validation set, while training each method with their respective loss parameters.