Scene Uncertainty and the Wellington Posterior of Deterministic Image Classifiers

06/25/2021 ∙ by Stephanie Tsuei, et al. ∙ 4

We propose a method to estimate the uncertainty of the outcome of an image classifier on a given input datum. Deep neural networks commonly used for image classification are deterministic maps from an input image to an output class. As such, their outcome on a given datum involves no uncertainty, so we must specify what variability we are referring to when defining, measuring and interpreting "confidence." To this end, we introduce the Wellington Posterior, which is the distribution of outcomes that would have been obtained in response to data that could have been generated by the same scene that produced the given image. Since there are infinitely many scenes that could have generated the given image, the Wellington Posterior requires induction from scenes other than the one portrayed. We explore alternate methods using data augmentation, ensembling, and model linearization. Additional alternatives include generative adversarial networks, conditional prior networks, and supervised single-view reconstruction. We test these alternatives against the empirical posterior obtained by inferring the class of temporally adjacent frames in a video. These developments are only a small step towards assessing the reliability of deep network classifiers in a manner that is compatible with safety-critical applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It appears that Deep Neural Networks (DNNs) can classify images as well as humans, at least as measured by common benchmark photo collections, yet small perturbations of the images can cause changes in the predicted class. Even excluding adversarial perturbations, simply classifying consecutive frames in a video shows variability inconsistent with the reported error rate (Fig.1). So, how much should we trust image classifiers? How confident should we be of the outcome rendered on a given image? There is a substantive literature on uncertainty quantification, including work characterizing the (epistemic and aleatoric) uncertainty of trained classifiers (Sect. 1.1). Such uncertainty is a property of the classifier, not of the outcome of classifying a particular datum. We are interested in ascertaining how “confident” to be in the response of a particular DNN model to a particular image, not generally how well the classifier performs on images from a given class.

Say we have an image

, and a DNN that computes a discriminant vector

with as many components as the number of classes (e.g.,logits or softmax) from which it returns the estimated label “cat.” How sure are we that there is a cat in this image? Even if the classifier was wrong most of the times on the class “cat,” so long as it is confident that its answer on this particular image is correct, we would be content. If faced with the question “are you sure?” a human would take a second look, or capture a new image, and either confirm or express doubt. But a DNN classifier would return the same answer, correct or not, since most real-world deep networks in use today are deterministic maps from the input to the output .111 It is common, but incorrect, to use the value of the discriminant as a measure of confidence, since the Bayesian posterior is not a minimizer of the empirical risk.

Since the classifier is deterministic, and the image is given, the key question is with respect to what variability should uncertainty be evaluated. To address the key question, we introduce the Wellington Posterior (WP) of a deterministic image classifier, which is the distribution of outcomes that would have been obtained in response to data that could have been generated by the same scene that produced the given image.

Simply put, there are no cats in images, only pixels. Cats are in the scene, of which we are given one or more images. The question of whether we are “sure” of the outcome of the an image classifier is therefore of counterfactual nature: Had we been given different images of the same scene, would an image-based classifier have returned the same outcome?

Formally, given an image

, if we could compute the posterior probability of the estimated label,

from which we can then measure confidence intervals, entropy, and other statistics commonly used to measure uncertainty, what we are after is

not . Instead, it is where and is one of the infinitely many scenes that could have yielded the given image . The Wellington Posterior is based on the distribution of images that could be generated by the unknown scene portrayed in the given image . We call the Wellington Posterior, in reference of Duke Wellington’s life endeavor to guess what is at the other side of the hill. Such guess cannot be based directly on the given data, for one cannot see behind the hill, but on hypotheses induced from by having seen behind other hills before. The scene , whether real or virtual, is the vehicle that allows one to guess what is not known () from what is known (). The crux of the matter, then, to generate the Wellington Posterior is to characterize the scene, which we discuss in Sect. 2. In particular, in Sect. 2.1 we discuss a hierarchy of models to be compared empirically in Sect. 3, with others in the Appendix. Our contributions, in relation to prior work, is discussed next.

1.1 Related Work and Contributions

Our work is related to a vast body of literature on inferring the uncertainty of a classifier, including Uncertainty Quantification (UQ) in Experimental Design and Bootstrapping. We focus specifically on deep neural networks, where uncertainty is typically attributed to the model, and only indirectly to the outcome of inference. For instance, Bayesian Neural Networks BNNBook ; BNN_survey ; bayesian_svi ; ll_svi produce not a single discriminant, but a distribution that captures epistemic uncertainty, from which one can obtain a distribution of outcomes on which to compute confidence or other uncertainty-quantifying statistic. Similarly, Dropout mc_dropout can be used at test time to generate ensemble outcomes from the same datum, and aggregate statistics to generate uncertainty estimates. These methods are seldom used in practice as they multiply the inference cost by the size of the ensemble, which is typically prohibitive.

Other approaches quantify aleatoric uncertainty, or the variation of the output in response to specific variations of the input. For example, gast_lightweight_2018 ; kendall_what_2017

consider the effect on the output of (heteroscedastic) Gaussian noise in the input;

loquercio_general_2020 combines gast_lightweight_2018 and mc_dropout into a single system to measure both epistemic and aleatoric uncertainty. Related in spirit to our approach is CLUE , which adds a layer of counterfactual questioning on top of a Bayesian neural network to quantify the sensitivity of uncertainty estimates to changes in the input.

Calibration approaches ignore any distinction between epistemic and aleatoric uncertainty and focus on quantifying the value , where is the predicted confidence. Common approaches involve scaling the softmax vector to match the empirical distribution guo_calibration_2017 ; deep_ensembles . Bayesian approaches mc_dropout ; ll_svi ; bayesian_svi can also be used to produce confidence values, as in ovadia_can_2019 . The most common metric is the Expected Calibration Error (ECE), or the average distance between the predicted confidence and the true accuracy of samples with the same, or similar, value of , averaged over a test dataset of independent and identically distributed (i.i.d.) samples. See nixon_measuring_2019 for a detailed discussion of ECE and several variations. Another approach is based on conformal prediction angelopoulos_uncertainty_2020 ; barber_predictive_2021 , where one does not estimate uncertainty directly, but rather predict a set of answers such that , where is a design parameter. The number of elements in the set reflects the uncertainty of the network with respect to that particular input.

Our work also relates to causality pearl2009causality in a very generic sense, due to the counterfactual nature of the Wellington Posterior. However, our work is significantly different both in the methods and in the underlying philosophy, as we focus on inductive inference outside an established system of truth.

Since we consider putative, or hypothetical, variations of the input, uncertainty of the classifier is related to sensitivity, or robustness, to the input. Therefore, our work also relates to efforts to analyze the robustness or sensitivity of classifiers to perturbations in the input, even if they do not explicitly address uncertainty. We are not interested in purposefully designed (e.g., adversarial) perturbations, but rather natural perturbations, for instance due to geometric (viewpoint) or photometric changes (illumination, weather, etc.): hendrycks_benchmarking_2018

introduces the CIFAR-C, ImageNet-C, and ImageNet-P benchmark test sets. The CIFAR-C and ImageNet-C test sets contain CIFAR-10 and ImageNet images corrupted with additive noise and blur. ImageNet-P contains short videos that were created by small amounts of panning and tilting from ImageNet images. ImageNetVid-Robust

imagenetvidrobust is a test set of video frames for 30 classes from the 2015 ImageNet Object Tracking Challenge imagenetvid . Frames are selected so that the object of interest remains visible. Although similar in nature, there is more variability in ImageNetVid-Robust than in ImageNet-P. See manyfacesofrobustness

for an overview. Proposed metrics for these benchmark datasets are functions of accuracy, including the Flip Probability for the ImageNet-P benchmark, i.e. the probability that the classification between consecutive frames is different. We use it as empirical ground truth to measure the uncertainty of the classifier with respect to

real, not hypothetical, different images of the same scene.

Contributions:

Unlike work described above, ours does not propose a new measure of uncertainty nor a new way to calibrate the discriminant to match empirical statistics: We use standard statistics computed from the “posterior probability”, such as covariance and entropy, to measure uncertainty. The core of our work aims to specify what posterior to use to measure uncertainty. Since deep network image classifiers used in the real world are deterministic, the choice is consequential, and yet seldom addressed explicitly in the existing literature. The Wellington Posterior is introduced to explicitly characterize the variability with respect to which uncertainty is measured. This is our first contribution (Sect.2).

The Wellington Posterior is not general and does not apply to any data type. It is specific to images of natural scenes. No matter how many images we are given at inference time, there are infinitely many different scenes that are compatible with them, in the sense of possibly having generated them, including real and virtual scenes. Therefore, there are many different ways of constructing the Wellington Posterior. It is not our intention to give an exhaustive account of all possible ways, and it is beyond the scope of this paper to determine which is the “best” method, for that depends on the application (e.g., closed-loop operation vs. batch post-processing), the available data (e.g., a large static dataset vs. a simulator), the run-time constraints, etc. Our second contribution is to illustrate possible ways of computing the Wellington Posterior using a hierarchy of models (Sect. 2.1) including direct manipulation of the given images, with no knowledge of the scene, and perturbations of the classifier around a pre-trained model from different scenes. These are two extremes of the modeling spectrum. In between, one can infer a 3D model of the given image with any of the available single-view reconstruction methods, (see fu2021single for a recent survey), and then use that to synthesize new images, or directly synthesize images with a GAN (see clark2019adversarial and references therein).

The third contribution

is methodological, in that we show that the Wellington Posterior can be derived without sampling input images, but rather by looking at the effect of perturbations of the input images on the parameters of the model, and imputing a distribution of the latter using a closed-form linearization of the deep network around a pre-trained point (Sect.

2.2).

Figure 1: The predicted classes vary within a scene. A ResNet-101 trained on ImageNetVid should ideally classify each frame in an ImageNetVid-Robust with the same class. Instead (left) the

distance between the softmax vectors of the anchor frame and those of remaining frames shows significant spread, measured by the standard deviation as a function of mean. The “flip probability”

hendrycks_benchmarking_2018 (center) measures how often the predicted class changes from frame to frame. The long tail of the histogram captures scene uncertainty. The mean and standard deviation averaged over the dataset is for the exponentially-normalized (softmax) vectors. Finally, there is also a spread in the percentage of frames within a scene classified different from the mode class (right). None of these numbers are reflective of the training error, validation error, or test error recorded in Appendix A.1.

2 Method

We start by introducing the nomenclature used throughout the rest of the paper:
– The scene is an abstraction of the physical world. It can be thought of as a sample from some distribution that is unknown and arguably unknowable. The scene itself (not just its distribution) is arguably unknowable (the subject of Physics), but for some of its “attributes.”
– An attribute is a characteristics of the scene that belongs to a finite set (e.g., names of objects), . Note that there can be many scenes that share the same attribute(s) (intrinsic variability). For instance, can be the label “cat” and is the distribution of scenes that have a “cat.” Continuous, but finitely-parametrized, attributes are also possible, for instance related to shape or illumination.
Extrinsic variability is an unknown transformation of the scene that changes its manifestation (measurements, see next point) but not its attributes. It can be thought of as a sample from some nuisance distribution . For instance, extrinsic variability222Note that there can be spurious correlations between the attribute and nuisance variability: An indoor scene is more likely to contain a cat than a beach scene. Nevertheless, if there were a cat on the beach, we would want our classifier to say so with confidence. The fact that nuisance variables can correlate with attributes on a given dataset may engender confusion between intrinsic and extrinsic variability. To be clear, phenomena that generate intrinsic variability would not exist in the absence of the attribute of interest. The pose, color and shape of a cat do not exist without the cat. Conversely, ambient illumination (indoor vs. outdoor) exists regardless of whether there is a cat, even if correlated with its presence. The effect of nuisance variability on confidence, unlike intrinsic variability, is one of correlational, rather than causal, dependence. could be due to the vantage point of the camera, the illumination, partial occlusion, sensor noise, quantization etc., none of which depends on whether the scene is labeled “cat”.
– A measurement is a known function of both the scene (and therefore its attributes) and the nuisances. We will assume that there is a generative model that, if the scene was known, and if the nuisances were known, would yield a measurement up to some residual (white, zero-mean, homoscedastic Gaussian) noise

(1)

For example, can be thought of as a graphics engine where all variables on the right hand-side are given.
– The discriminant is a deterministic function of the measurement that can be used to infer some attributes of the scene. For instance, is the Bayesian discriminant (posterior probability). More in general, could be any element of a vector (embedding) space.
– The estimated class is the outcome of a classifier, for instance . Given an image and a classifier , we reduce questions of confidence and uncertainty to the posterior probability . In the absence of any variability in the estimator , defining uncertainty in the estimate requires assuming some kind of variability. The Wellington Posterior hinges on the following assumptions:

  • The class is an attribute of the scene and is independent of intrinsic and extrinsic variability, by their definition.

  • We posit that, when asking “how confident we are about the class ” we do not refer to the uncertainty of the class given that image, which is zero. Instead, we refer to uncertainty of the estimated class with respect to the variability of all possible images of the same scene which could have been obtained by changing nuisance (extrinsic) variability.

In other words, if in response to an image, a classifier returns the label “cat,” the question is not how sure to be about whether there is a cat in the image. The question is how sure to be that there is a cat in the scene portrayed by this image. For instance, if instead of the given image, one was given a slightly different one, captured slightly earlier or a little later, and the classifier returned “dog,” would one be less confident on the answer than if it had also returned “cat”? Intuitively yes. Hypothetical repeated trials would involve not running the same image through the classifier over and over, but capturing different images of the same scene, and running each through the classifier.

Intrinsic variability does not figure in the definition of the Wellington Posterior. The fact that we are given one image implies that we are interested in one scene, the one portrayed in the image. Even though there are infinitely many scenes compatible with it, the given image defines an equivalence class of extrinsic variability. So, the question of how sure we are of the answer “cat” given an image is not how frequently the classifier correctly returns the label “cat” on different images of different scenes that contain different cats. It is a question about the particular scene portrayed by the image we are given, with the given cat in it. The goal is not , which would be how frequently we say “cat” when there is one (in some scene). We are interested in this

scene, the one portrayed by the image. Written as a Markov chain we have

(2)

where the first arrow includes intrinsic variability (a particular attribute is shared by many scenes) and the second arrow includes nuisance/extrinsic variability (a particular scene can generate infinitely many images). The last two arrows are deterministic. We are interested only in the variability in the second arrow. To compute the Wellington Posterior, we observe that

(3)

That is, given samples from the nuisance variability , or sample images generated by changing nuisance variability, , we can compute the probability of a particular label by counting the frequency of that label in response to different nuisance variability. We defer the question of whether the samples given are fair or sufficiently exciting. In the expression above, is computed by the given DNN classifier, is from a chosen class of nuisance transformations, and is an image formation model that is also chosen by the designer of the experiment. What remains to be determined is how to create a “scene” from the given image .

As we defined it, the scene is an abstraction of the physical world. Such abstraction can live inside the memory of a computer. Since a scene is only observed through images of it, if a synthetic scene generates images that are indistinguishable from those captured of a physical scene, the real and synthetic scenes – while entirely different objects – are equivalent and for the purpose of computing the Wellington Posterior. Thus a “scene” could be any generative model that can produce images that are indistinguishable from real ones including the given one. Different images are then obtained by perturbing the scene with nuisance variability.

Depending on how we measure “indistinguishable,” and how sophisticated the class of nuisance variables we can conceive of, we have a hierarchy of increasingly complex models. In addition, depending on how broadly we sample nuisance variables, that is depending on , we have a more or less representative sample from the Wellington Posterior. Below we outline the modeling choices tested in Sect. 3 and in the appendix.

2.1 Modeling Hierarchy

Data Augmentation.

The simplest model we consider interprets the scene as a flat plane on which the given image is painted. Correspondingly, different images can be generated from via data augmentation, which consists of typically small group transformations of the image itself: where is the group of diffeomorphisms of the domain of the image, approximated by local affine planar transformations, and affine transformations of the range space of the image, also known as contrast transformations. These include small translations, rotations, and rescaling of the image, as well as changes of the colormap. In addition to affine domain and range transformations, one can also add i.i.d. Gaussian noise and paste small objects in the image, to simulate occlusion nuisances. Data augmentation is typically used to train classifiers to be robust, or insensitive, to the class of chosen transformations, which correspond to a rudimentary model of the scene, so we chose it as the baseline in our experiments in Sect. 3.

Explicit 3D scene reconstruction.

The model of the scene implicit in data augmentation does not take into account parallax, occlusions, and other fundamental phenomena of image formation. One step more general, we could use knowledge of the shape and topology of other scenes, manifest in a training set, to predict (one or more) scenes from a single image. Since there are infinitely many, the one(s) we predict is a function of the inductive biases implicit and explicit in the method. There there are hundreds of single-view reconstruction methods, including so-called “photo pop-up” hoiem2005automatic used for digital refocusing and computational photography, well beyond what we can survey in this paper. Our limited experiments show similar performance to data augmentation. Formally, is a scene compatible with the given image , which is represented as a distribution depending on a dataset of scenes and corresponding images from it. These can be obtained through some ground truth mechanism, for instance a range sensor or a simulation engine.

Conditional Generative Adversarial Networks.

Finally, Conditional Generative Adversarial Networks that generates other similar images or video conditioned on an image, may also be a source of imputation. In particular, cGAN was designed for data augmentation.

(a) Data Augmentation
(b) Dropout
(c) Ensemble
(d) LQF
Figure 2: Wellington Posteriors vs. Empirical Paragon. For ImageNetVid-Robust, for each of the proposed WPs, as well as the empirical paragon, we compute the mean and sample covariance of the discriminants (logits). We then measure the distance from each means to the empirical one (abscissa), and the Frobeinius norm of the difference between each covariance and the empirical paragon (ordinate) using a ResNet-101 trained on ImageNetVid anchor frames (the middle frame of each video). The smaller and farther left the scatter, the closer the match to the empirical statistics. The posterior computed with LQF yields the closest approximation of the empirical posterior ( vs. for other methods).

2.2 Analytical Posterior through Linearization.

All the models above aim to generate a sample from the distribution . We showed a few examples of possible methods for generating a scene and sampling from it, none viable in practice since they would require explicit sampling. Instead, we seek for analytical expressions of the Wellington Posterior that do not require multiple inference runs. Recent development in network linearization suggest that it is possible to perturb the weights of a trained network locally to perform novel tasks essentially as well as non-linear optimization/fine-tuning achille_lqf_2020 . Such linearization, called LQF, is with respect to perturbations of the weights of the model. Therefore, we use the closed-form analytical Jacobian of the linearized network to compute first and second-order statistics (mean and covariance) of the discriminant, without the need for sampling. The first-order approximation of the network around the pre-trained weights is

(4)

In equation (4), has a closed-form solution that is a function of the training data. Then, equation (4) can be used to define a stochastic network with weights distributed according to for any desired covariance . Correspondingly, the discriminant and logits can be assumed to be distributed as

(5)

Given a diagonal or block-diagonal value of , we compute using forward passes and backward passes per input image over the neural network. This is not a fast computation, but requires far less resources than a naive implementation. More implementation details are given in Appendix B.

Next, given , the probability that class is the predicted class is given by:

(6)

The above is a -dimensional integral. The lower limit of integration in every dimension is while the upper limit is variable. For , , and , equation (6) is:

(7)

where is the density of :

(8)

Equation (6) is challenging to compute even for relatively small number of classes such as , even with Monte-Carlo sampling methods. However, if is a diagonal matrix, equation (6) can be reduced to:

(9)

where is the -th element along the diagonal of and . The full derivation is given in Appendix B. Equation (9) has no closed form that we know of, but can be numerically approximated using a quadrature method.

For values of that are not well-approximated by their diagonals, statistics of the distribution in equation (5) can be approximated and compared them to the empirical paragon in Section 3 through sampling. This is, of course, not ideal, but computationally fast.

Figure 3: Distribution of distances between the empirical and predicted categorical distributions for ResNet-101 trained on ImageNetVid anchor frames. For each scene, we compute the empirical class distribution (counting the occurrence of classes in the video) using all the frames in the scene and compare it against the predicted class distribution. Histogram (left) and box plot (right) of

distances between the predicted and empirical class distributions. LQF has the lowest skew.

3 Experiments

Datasets.

We performed experiments on two datasets, Objectron objectron2021 and the ImageNetVid and ImageNetVid-Robust datasets described in Sect. 1.1. Objectron objectron2021 contains short video clips of scenes with 9 classes: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes. We split it into a training and a validation set, with an equal number of videos of each class in each split. Then, we selected a subset of 5000 videos for training set, 500 videos for validation set, and 500 for testing.

Base Discriminant.

In order to compute the empirical posterior and the various forms of Wellington Posterior, we need a discriminant function , from which to build a classifier. We use the backbone of an ImageNet-pretrained ResNet-50 or ResNet-101 resnet , where is called the vector of logits, whose maximizer is the selected class, and whose normalized exponential is called softmax vector. With an abuse of notation, when clear from the context, we refer to as both the vector of logits and the softmax. Fine-tuning and validation for ResNet-50 on Objectron, and ResNet-101 on ImageNetVid used one frame from each scene. The focus is not to achieve the highest possible accuracy, but to provide a meaningful estimate of uncertainty relative to the variability of different images of the same scene. For this reason, we select the most common, not the highest performing, baseline classifier. We achieve > 95% validation accuracy for Objectron and 75% validation accuracy for ImageNetVid.

Ground Truth and Empirical Baseline.

Validating the Wellington Posterior estimated from a single image requires a ground-truth distribution. This can be obtained empirically by computing different discriminants, and correspondingly the distribution of of different outcomes, from an actual sample of different images of the same scene. For this purpose, we use short videos from the Objectron and ImageNetVid-Robust datasets. Such short videos are not a fair sample from the population of all possible images that the scene could have generated, so the resulting statistics cannot be considered proper ground truth. However, the video represents an empirical baseline in line with the definition of Wellington Posterior, so we adopt it as a paragon for evaluation. For each frame, we compute the discriminant vector of logits, its softmax-normalized version, and the highest-scoring hypothesis. From these, we can compute summary statistics such as frequency of each label in the categorical distribution, or mean and covariance of the logits as if coming from a Gaussian distribution in

-dimensional space. Our experiments focus on estimating the mean and covariance of the logits and the frequencies of each label in the categorical distribution.

Objectron (ResNet-50) ImageNetVid (ResNet-101)
Categorical Dist. Logit Mean Logit Covariance Categorical Dist. Logit Mean Logit Covariance
Data Augmentation 0.0789 0.0038 5.4253 0.1225 15.6404 0.1681 0.2300 0.0123 3.9731 0.1063 9.0627 0.4604
Dropout 0.0512 0.0048 3.7304 0.0743 12.8472 0.6759 0.1746 0.0042 3.7304 0.0743 12.8472 0.6759
Deep Ensembles 0.0575 0.0028 5.6469 0.1106 19.2226 0.2474 0.2160 0.0102 4.4732 0.1331 19.2226 0.2474
LQF 0.0579 0.0034 3.2624 0.0293 7.3136 0.2869 0.1220 0.0029 3.2624 0.0293 7.3136 0.2869
Monocular Depth Est. 0.0825 0.0075 5.0966 0.0634 16.0126 0.3195 - - -
Table 1: Means and standard deviation of distances between categorical distributions of each WP and the empirical paragon for ResNet-101 on ImageNetVid-Robust and ResNet-50 on Objectron. Mean and standard deviation are over three trials with three sets of neural networks and an ensemble of 20. The lowest number is boldfaced. The WP computed with LQF yields the lowest mean and covariance and also produces the categorical distribution closest to the empirical one for ImageNetVid-Robust, but not Objectron. Objectron is an easy classification problem where almost all images from all scenes are classified correctly. Therefore, all methods perform similarly.

Metrics.

We use three metrics to validate predicted Wellington Posteriors against the empirical distribution computed using multiple frames per video: the distance between the mean logit and the empirical mean logit for each scene, the Frobenius distance between the sample covariance of the empirical and predicted logits, and the distance between the probability vectors of the empirical and predicted categorical distributions. These metrics illustrate first and second order differences in the WP and empirical distribution of the logits and a first order difference in the distribution of the predictions.

Common Pseudo-Uncertainty and Confidence Scores.

As we pointed out in Footnote 1, it is common to use the values of the Softmax vector as a proxy of uncertainty. In some cases, the vector is “temperature-scaled” guo_calibration_2017 ; in other cases, the entropy of the softmax vector or the energy liu_energy-based_2020 are used as proxy of uncertainty. Additional experiments on isotonic calibration and the use of auxiliary networks detailed in Appendix C indicate that there is no strong connection between these pseudo-uncertainties and the empirical posterior.

Data Augmentation.

To compute results for the simple baseline, we use the Albumentations imaging library buslaev_albumentations_2020 to perform data augmentation using the following procedure:

  • Horizontal flip with probability 0.5

  • Shift by up to 20 percent of image size in both horizontal and vertical directions

  • Scale randomly selected from

  • Safe rotation randomly selected from degrees

  • center crop

Model-based Uncertainty.

Small perturbations of the image are dual to small perturbations of the weights in the first layer, so one could generate a posterior estimate over outcomes to a single image by considering variability of the classifier. Bayesian Neural Networks and VAEs kingma2019introduction operate on this principle, but are not in use as image classifiers due to their lower performance and higher cost. As a representative of this class of methods, we consider ensembles of 20 neural networks, formed by withholding a random 5% of the training data and through random shuffling of the data. The Wellington Posterior generated from an ensemble is the distribution of class predictions among the 20 members’ response to a single frame, shown in Fig. 2 (c). We also use a (pseudo)-ensemble generated by Monte-Carlo Dropout in Fig. 2 (b).

Single-view 3D reconstruction.

One step beyond data augmentation is to manipulate (a coarse reconstruction of) the scene, rather than the image directly. We use VOICED to generate a disparity map, and thence a depth map, from a single image – based on a training set of images and corresponding ground-truth depth obtained from a range sensor.333Objectron is a 3D tracking dataset and VOICED is a depth completion network, so we leveraged the sparse depth measurements associated with the anchor frame to produce a depth map, but no other frames. We chose to make this slight deviation from our model hierarchy because it produced realistic images without extensive fine-tuning experiments; this experiment should be considered an upper-bound on the performance of single-view 3D reconstruction for Objectron. More details are given in Appendix A.2. This yields a point estimate from , where is the dataset used for training the single-view reconstruction model. We then apply small spatial deformations , or , and render images through standard texture-mapping including modulation of the range space with small contrast transformations of the color map. Performance of this approach relative to the baseline is reported in Table 1 for the Objectron dataset. Existing off-the-shelf single view reconstruction methods did not yield sensible images on ImageNetVid-Robust; example results are shown in Appendix A.3.

Conditional Generative Adversarial Networks.

State of the art Conditional GANs do not generalize well to out-of-distribution data. Appendix A.4 contains some examples of generated images from a pretrained network in cGAN conditioned on Objectron anchor images. Since extensive experiments are beyond our scope here, we do not investigate GANs further.

Implicit Local Generation.

The viability of the method we propose to impute uncertainty to a deterministic classifier hinges on the ability to produce an estimate without having to sample multiple inference runs at test time. Recent work on model linearization around a pre-trained point achille_lqf_2020 has shown that it is possible to obtain performance comparable to that of full-network non-linear fine-tuning. In this sense, LQF can be used as a baseline classifier instead of the pre-trained network. The main advantage is that LQF allows explicit computation of the covariance of the discriminant without the need to sample multiple input data. The covariance of the discriminant is computed from (5) with a design parameter, which we choose to match the distribution generated by the ensemble. Then, we can compute the distribution of in (5) using forward passes and backward passes. Then, a categorical distribution can be computed using equation (9). The WP generated using LQF is compared to the empirical paragon in Fig. 2 (d). Quantitative results are summarized in Table 1.

4 Discussion

Our method has many limitations: First, there is no “right” model, so evaluating the WP, and even choosing a veritable “ground truth,” presents challenges. Our empirical analysis is limited by the availability of datasets in the public domain, none of which comprises a fair sample of the distribution of images of the same scene. Nonetheless, 20 frames of video provide better support for uncertainty validation than a single image. A better approach would be to have a fully controlled, realistic simulator, in combination with more limited sets of real data.

Second, for sampling-based methods, we have no way of ascertaining whether the sample generated from the (implicit or explicit) scene is sufficiently exciting, in the sense of exciting all the modes of variation of the Wellington Posterior. The only evidence we have thus far is that such variability is better aligned with the empirical distribution than simple image-based data augmentation. More importantly, sampling-based approaches are prohibitive in reality, especially for closed-loop operation, which is where uncertainty estimates are most important.

Third, the only approach for non-sampling approximation of the WP is based on linearization. While models beyond linear are unlikely to be tractable, there may be still better models for inferring second-order statistics either directly from the primary trained network, as done in Variational Autoencoders, or through an auxiliary network.

The weakest aspect of our modeling framework is the limitation of some models to small nuisance perturbations. This is mildly relaxed for single-view 3D reconstruction (scene transformations beyond small ones create gaps and holes in the reconstruction), and less so for CPNs and GANs. Again, the linearized model is constrained to transformations of the images induced by small perturbations in the weights. This essentially guarantees that the Wellington Prior will not cover the domain of the true posterior.

We have provided initial empirical evaluation, but a thorough and rigorous testing is beyond the scope of this paper and will have to be performed in an application-specific manner. For example, evaluating the uncertainty of a label associated to an image to decide whether to feed that image to a human annotator is an entirely different problem than evaluating uncertainty for the purpose of deciding whether to drive through a city street. We have focused our analysis on the extremes: A baseline based on data augmentation, ensembles, and an analytical characterization of the WP. More methods based on explicit generation, such as CPNs, GANs, or single-view 3D reconstruction, can be further explored. However, all these methods would require explicit sampling of images and repeated inference runs to generate the empirical statistics, which is unfeasible in real-world applications.

Inspection of the Markov Chain in Sect. 1 suggests that the Wellington Posterior goes against the Data Processing inequality: contains no more information than , including on the uncertainty of the outcome . This, however, is not the case as there is inductive bias in the way the scene is constructed, which typically involves additional information manifest in the training set or the design of the nuisance variability used in the rendering process . In a sense, the Wellington Prior attributes an uncertainty to the current image by associating it to scenes, other than the present one, that might have yielded images similar to the one given, sometime in the past – represented by the training set. Beyond the additional information from the training set, the Wellington Posterior acts as a regularization mechanisms by exploiting regularities from different scenes to hypothesize variability of images of the present one.

References

  • [1] Alessandro Achille, Aditya Golatkar, Avinash Ravichandran, Marzia Polito, and Stefano Soatto. LQF: Linear Quadratic Fine-Tuning. arXiv:2012.11140 [cs, stat], December 2020. arXiv: 2012.11140.
  • [2] Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2021.
  • [3] Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty Sets for Image Classifiers using Conformal Prediction. September 2020.
  • [4] Javier Antoran, Umang Bhatt, Tameem Adel, Adrian Weller, and José Miguel Hernández-Lobato. Getting a CLUE: A Method for Explaining Uncertainty Estimates. September 2020.
  • [5] Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486–507, February 2021. Publisher: Institute of Mathematical Statistics.
  • [6] Alexander Buslaev, Alex Parinov, Eugene Khvedchenya, Vladimir I. Iglovikov, and Alexandr A. Kalinin. Albumentations: fast and flexible image augmentations. Information, 11(2):125, February 2020. arXiv: 1809.06839.
  • [7] Lucy Chai, Jun-Yan Zhu, Eli Shechtman, Phillip Isola, and Richard Zhang. Ensembling with deep generative views. In CVPR, 2021.
  • [8] Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019.
  • [9] Kui Fu, Jiansheng Peng, Qiwen He, and Hanxiao Zhang.

    Single image 3d object reconstruction based on deep learning: A review.

    Multimedia Tools and Applications, 80(1):463–498, 2021.
  • [10] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In

    International Conference on Machine Learning

    , pages 1050–1059, June 2016.
  • [11] Jochen Gast and Stefan Roth. Lightweight Probabilistic Deep Networks. pages 3369–3378, 2018.
  • [12] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 1321–1330. JMLR.org, 2017. event-place: Sydney, NSW, Australia.
  • [13] Fredrik K. Gustafsson, Martin Danelljan, and Thomas B. Schon. Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision. pages 318–319, 2020.
  • [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. pages 770–778, 2016.
  • [15] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. arXiv:2006.16241 [cs, stat], August 2020. arXiv: 2006.16241.
  • [16] Dan Hendrycks and Thomas Dietterich. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. September 2018.
  • [17] Derek Hoiem, Alexei A Efros, and Martial Hebert. Automatic photo pop-up. In ACM SIGGRAPH 2005 Papers, pages 577–584. 2005.
  • [18] Alex Kendall and Yarin Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5574–5584. Curran Associates, Inc., 2017.
  • [19] Diederik P Kingma and Max Welling. An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691, 2019.
  • [20] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6402–6413. Curran Associates, Inc., 2017.
  • [21] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based Out-of-distribution Detection. Advances in Neural Information Processing Systems, 33, 2020.
  • [22] Antonio Loquercio, Mattia Segu, and Davide Scaramuzza. A General Framework for Uncertainty Estimation in Deep Learning. IEEE Robotics and Automation Letters, 5(2):3153–3160, April 2020.
  • [23] Radford M. Neal. Bayesian Learning for Neural Networks. Springer Science & Business Media, December 2012.
  • [24] Jeremy Nixon, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring Calibration in Deep Learning. pages 38–41, 2019.
  • [25] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. Advances in Neural Information Processing Systems, 32:13991–14002, 2019.
  • [26] Judea Pearl. Causality. Cambridge university press, 2009.
  • [27] Carlos Riquelme, George Tucker, and Jasper Snoek.

    Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling.

    February 2018.
  • [28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [29] Vaishaal Shankar, Achal Dave, Rebecca Roelofs, Deva Ramanan, Benjamin Recht, and Ludwig Schmidt. Do Image Classifiers Generalize Across Time? arXiv:1906.02168 [cs, stat], December 2019. arXiv: 1906.02168.
  • [30] Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, and Roger Grosse. Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches. February 2018.
  • [31] Alex Wong, Xiaohan Fei, Stephanie Tsuei, and Stefano Soatto. Unsupervised Depth Completion From Visual Inertial Odometry. IEEE Robotics and Automation Letters, 5(2):1899–1906, April 2020. Conference Name: IEEE Robotics and Automation Letters.
  • [32] Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of virtual normal for depth prediction. In The IEEE International Conference on Computer Vision (ICCV), 2019.
  • [33] Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’02, pages 694–699, New York, NY, USA, July 2002. Association for Computing Machinery.

Appendix A Additional Experiment Details and Results

a.1 Network Hyperparameters and Classifier Performance

Hyperparameters, training error, and validation error for training standard image classifiers using the cross entropy loss are given in Table 2. Hyperparameters, training error, and validation error for training LQF networks are given in Table 3

. Pretrained ImageNet networks were fine-tuned for one epoch on ImageNet with ReLUs replaced by LeakyReLUs prior to LQF with the hyperparameters in Table

4. When training all networks, we saved checkpoints along with its training and validation accuracy every 10 epochs. We used a combination of training and validation errors to select an appropriate checkpoint for the rest of the experiments. Training networks with dropout for the experiments with dropout used the same hyperparameters as standard networks. Training and validation errors are shown in Table 5.

To train ensembles of standard networks and LQF networks, we used the same hyperparameters and procedure as for the baseline networks. The only difference is that each ensemble saw different data shuffling and a different 90% fraction of the data.

Test accuracy on the Objectron test set was 0.9607 0.0015 for all frames and 0.9693 0.0076 for only the anchor frames. Test accuracy on ImageNetVid-Robust was 0.7187 0.0084 for all frames and 0.7207 0.0139 for only the anchor frames. These test errors do not reflect the variation shown in Figures 1 and 7 for ImageNetVid-Robust and Objectron, respectively.

Parameter Objectron ImageNetVid
Architecture ResNet-50 ResNet-101
Initial Learning Rate 0.01 0.001
Momentum 0.9 0.9
Weight Decay 1e-5 1e-5
0.1 0.1
Milestones 25, 35 25, 35
Max Epochs 50 50
Batch Size 32 32
Selected Epoch 50, 40, 40 30, 50, 50
Training Error at Selected Epoch 0.0326, 0.0366, 0.0318 0.0478, 0.0478, 0.0476
Validation Error at Selected Epoch 0.0320, 0.0220, 0.0320 0.2747, 0.2711, 0.2912
Table 2:

Hyperparameters and decision variables used in fine-tuning image classifiers on the Objectron and ImageNetVid datasets using stochastic gradient descent and the cross-entropy loss for all three trials.

Parameter Objectron ImageNetVid
Architecture ResNet-50 ResNet-101
Initial Learning Rate 5e-4 5e-4
Weight Decay 1e-5 1e-5
Max Epochs 50 50
15.0 15.0
Batch Size 32 32
Selected Epoch 50, 40, 50 40, 30, 40
Training Error at Selected Epoch 0.0186, 0.0220, 0.0180 0.0500, 0.0542, 0.0457
Validation Error at Selected Epoch 0.0000, 0.0200, 0.0280 0.2692, 0.2601, 0.2674
Table 3: Hyperparameters and decision variables used in fine-tuning image classifiers on the Objectron and ImageNetVid datasets using LQF. Our LQF procedure used the AdamW optimizer with a one-hot mean squared error loss.
Parameter Value (ResNet-50) Value (ResNet-101)
Batch Size 32 32
Learning Rate 1e-3 1e-3
ReLU Leak 0.2 0.2
Training Error 0.43 0.40
Validation Error 0.41 0.31
Table 4: Hyperparameters used to fine-tune pretrained ResNet-50 and ResNet-101 networks on ImageNet with LeakyReLUs replacing ReLUs prior to LQF using the cross-entropy loss. Note that the goal of this fine-tuning was to slightly adjust pretrained weights towards values that accommodate LeakyReLUs, not to achieve low training and/or validation error, as these are not the final weights used in our experiments.
Parameter Objectron ImageNetVid
Max Epochs 50 50
Batch Size 32 32
Selected Epoch 40, 50, 50 50, 50, 50
Training Error at Selected Epoch 0.0490, 0.0504, 0.0488 0.1110, 0.1079, 0.1084
Validation Error at Selected Epoch 0.0380, 0.0240, 0.0380 0.2527, 0.2619, 0.2564
Table 5: Training and validation errors for networks trained with dropout. As expected, training errors are higher. Validation errors are lower than for standard networks for the more difficult ImageNetVid dataset, but comprable for the easy Objectron dataset, as is typically observed when training with dropout.

a.2 Explicit 3D Scene Reconstruction for Objectron

Since the Objectron dataset was originally created for 3D object tracking, the dataset contains a sparse 3D point cloud associated each video frame and camera intrinsics. The points in the 3D point cloud are the output of a tracking algorithm, not a range sensor, making it an appropriate use case for VOICED [31], a depth completion algorithm. Results reported in Table 1 are computed using a pretrained network trained on the VOID dataset distributed with the VOICED source code.

Once a 3D depth map is computed, we rotate and translate all points by a randomly sampled small rotation and translation. Rotations are roll, pitch, yaw values sampled from the uniform distribution

. Translations are sampled from the uniform distribution where is the mean estimated depth in meters. Finally, images were projected back onto the 2D plane and inpainted using standard techniques to fill in any gaps. Ultimately, We found that fine-tuning was not necessary to create realistic-looking reconstructions and reprojections. Example reprojections are shown in Figure 4.

3D depth completion, as opposed to monocular depth estimation described in Section 2, is both easier and more realistic for the reconstruction depends not only on the prior, but also on the sparse range measurements. The resulting performance should be considered an upper bound of achievable performance with just monocular 3D reconstruction. The choice is opportunistic since the Objectron dataset affords it and the reconstruction is more realistic than single-view 3D reconstruction.

Figure 4: Examples of reprojected Objectron images after 3D monocular depth completion. The left column contains anchor frames from the Objectron test set. The middle and right columns contain images generated through monocular depth completion, a random camera shift, and inpainting. We notice that there is no excessive distortion in the reprojected images, showing that 3D monocular depth estimation combined with small sampled camera rotation and translation can be used to impute scenes.

a.3 Explicit 3D Scene Reconstruction for ImageNetVid-Robust

We attempted to use state-of-the-art monocular 3D depth estimation [32] with small camera movements drawn from the same distributions as in the previous section. The depth estimation network was pretrained on the KITTI dataset of outdoor driving scenes. Since the ImageNetVid-Robust dataset does not contain any camera intrinsics, we assumed that the center of the camera apeture was located in the middle of the frame and that the focal length was equal to 1. The results, shown in Figure 5, led us to conclude that 3D monocular depth estimation on ImageNetVid images is not possible without fine-tuning. Furthermore, the lack of any (sparse or dense) depth information or camera intrinsics will make fine-tuning extremely difficult.

Figure 5: Examples of reprojected ImageNetVid-Robust images after 3D monocular depth estimation. The left column contains anchor frames from the ImageNetVid-Robust dataset. Other images are generated through depth estimation, random small camera shifts, and reprojection. The presence of thick black borders around many ImageNetVid-Robust images, not present in any dataset used to train 3D monocular depth estimation networks, further lowers the quality of estimated depth.

a.4 GAN Ensembling for Objectron

We attempted to use a pretrained network from [7] to impute scenes from anchor frames in the Objectron test set. The results, shown in Figure 6, show that imputing scenes using a conditional GAN will require fine-tuning, which we defer to future work.

Figure 6: Examples of Objectron scenes imputed using a pretrained GAN. Original center crops from the Objectron test set are shown in the left column. Corresponding images generated using a pretrained GAN without any fine-tuning are in the right three columns.

a.5 Computational Resources

All experiments were performed on a workstation with the following specs:

  • CPU (8 cores): Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz

  • RAM: 64 GB

  • GPU 0: NVIDIA TITAN V, 12GB RAM

  • GPU 1: NVIDIA GeForce GTX 1080, 11GB RAM

  • Python 3.9

  • PyTorch 1.8

a.6 Objectron Figures

Analogs to Figures 1, 2, and 3 for the Objectron dataset are shown in Figures 7, 8, and 9. Compared to ImageNetVid, there is less variability in the predictions, but far more variability in the logits.

Figure 7: The predicted classes vary within a scene. A ResNet-50 trained on Objectron should ideally classify each frame in an Objectron video with the same class. Instead (left) the distance between the softmax vectors of the anchor frame and those of remaining frames shows significant spread, measured by the standard deviation as a function of mean. The “flip probability” [16] (center) measures how often the predicted class changes from frame to frame. The long tail of the histogram captures scene uncertainty. The mean and standard deviation averaged over the dataset is for the exponentially-normalized (softmax) vectors. The plot on the right is a histogram of the percent of frames within each scene that are classified differently from the mode class for the scene. Unlike the corresponding figure with ImageNetVid in Figure 1, these numbers are similar to the training, validation, and test errors, but only because the Objectron dataset is an easy classification problem and all errors are small.
(a) Data Augmentation
(b) Dropout
(c) Ensemble
(d) LQF
Figure 8: Wellington Posteriors vs. Empirical Paragon. For Objectron, for each of the proposed WPs, as well as the empirical paragon, we compute the mean and sample covariance of the discriminants (logits). We then measure the distance from each means to the empirical one (abscissa), and the Frobeinius norm of the difference between each covariance and the empirical paragon (ordinate) using a ResNet-50 trained on Objectron anchor frames (the middle frame of each video). The smaller and farther left the scatter, the closer the match to the empirical statistics. The posterior computed with LQF yields the closest approximation of the empirical posterior ( vs. for other methods).
Figure 9: Distribution of distances between the empirical and predicted categorical distributions for ResNet-50 trained on Objectron anchor frames. For each scene, we compute the empirical class distribution (counting the occurrence of classes in the video) using all the frames in the scene and compare it against the predicted class distribution. Histogram (left) and box plot (right) of distances between the predicted and empirical class distributions. As indicated in Table 1 and compared to the corresponding ImageNetVid figures (Figure 3), all distributions have similarly low skew.

Appendix B Logit Covariance Using LQF

b.1 Computing Logit Covariance from Diagonal Weight Covariance

Calculating from using standard matrix multiplication requires more computer memory than is available on a typical workstation. Therefore, we compute column by column using the following procedure that performs forward passes and backward passes per image. Please see function getOutCov in our attached source code for more detail:

For each class , corresponding to column of :

  • In the network object, set all values of to 0.

  • Compute dummy output

    and dummy loss function (first forward pass), where

    is a one-hot encoding of the correct class:

  • Run a backward pass. PyTorch will compute and store , even though is not actually dependent on .

  • Multiply the stored values of by the values in the diagonal approximation of .444We accomplish this step using an implementation of a simple preconditioner that edits the values of stored gradients. Copy these values to the variable .555We accomplish this using SGD’s update step with a learning rate of 1.0.

  • Run a second forward pass to compute . is column of .

The above procedure is complex and uses PyTorch in ways it was not meant to be used. Next, forward passes and backward passes per image, although computationally feasible, is still too slow to be practical. The results in Table 1 motivate future work on more practical ways to compute . Using our computational resources, computing for all anchor images of the ImageNetVid test set requires  3 hours with one GPU.

b.2 Finding a Value for Weight Covariance

In our experiments, we set

to the sample variance of the linear weights. The sample variance came from an ensemble of 20 trained LQF networks. However, it is also admissible to treat

as a tuning parameter over a validation dataset, the only requirement is that be a positive semi-definite matrix. For now, though, the computationally slow procedure to compute from described in the previous section makes the sample variance the most practical value for , as each iteration of tuning requires hours while training an ensemble of 20 LQF networks requires only hours and yields reasonable results.

b.3 Transforming a Normal Distribution into Class Probabilities

Here, we show the steps to derive equation (9) from (6) when is a diagonal matrix. Without loss of generality, assume that we are computing . Then, equation (6) is equivalent to the expression:

(10)

Next, introduce the change of variables and let . The above integral then becomes:

(11)

Using the identity

(12)

and the fact that for , equation (11) then becomes the formula in (9).

Appendix C Pseudo-Uncertainties and Normalized Scene Entropy

This section details preliminary work that ruled out the softmax vector and other pseudo-uncertainties from our model hierarchy, as stated in Section 3. Let be the probability vector of the empirical categorical distribution and be the elements of . Then, the entropy of is

(13)

The maximum possible entropy for any classification problem with classes is:

(14)

Therefore, the normalized scene entropy

(15)

may be taken as a one possible measure of uncertainty based on the scene distribution. Note that predicting scene entropy or normalized scene entropy is not the same as creating an accurate Wellington Posterior, as two different distributions may have the same entropy. Therefore, predicting normalized scene entropy is an easier problem than creating an accurate Wellington Posterior, as two different categorical distributions may have the same entropy. We evaluate predictions of normalized scene entropy using the scene uncertainty calibration error (SUCE) computed over the anchor frames of a dataset of videos:

where is the normalized scene entropy computed using the empirical paragon and is scene entropy estimated using information from a single anchor frame. To estimate scene entropy, we consider the softmax vector, temperature scaled softmax vectors, isotonic regression [33], and auxiliary networks. Summary results for both ImageNetVid and Objectron are shown in Table 6. More details on each particular method are given in the subsections below.

(a) Softmax Vector
(b) Temp. Scaling
(c) Isotonic - max softmax
(d) Isotonic - entropy
(e) Isotonic - energy
(f) Aux. Net - logits
(g) Aux. Net - softmax
(h) Aux. Net - embed.
Figure 10: Normalized scene entropy of ImageNetVid-Robust videos cannot be predicted from the anchor frame’s logits. In the plots above, each point represents a scene. The y-axis of each point is normalized scene entropy computed using the empirical paragon while the x-axis of each point is normalized scene entropy estimated from the output of a single forward pass over the anchor frame. If normalized scene entropy were predictable, then points would lie along the black line.
(a) Softmax Vector
(b) Temp. Scaling
(c) Isotonic - max softmax
(d) Isotonic - entropy
(e) Isotonic - energy
(f) Aux. Net - logits
(g) Aux. Net - softmax
(h) Aux. Net - embed.
Figure 11: Normalized scene entropy of Objectron videos cannot be predicted from the anchor frame’s logits. In the plots above, each point represents a scene. The y-axis of each point is normalized scene entropy computed using the empirical paragon while the x-axis of each point is normalized scene entropy estimated from the output of a single forward pass over the anchor frame. Compared to ImageNetVid-Robust (Figure 10), the graphs appear sparse because most frames are in the bottom-right corner. This is due to the fact that Objectron is a much easier dataset, and almost all frames of all videos are correctly classified, which makes a typically overconfident softmax vector close to correct. However, for videos for which empirical normalized scene entropy is nonzero, i.e. there is some variation in prediction, the methods above are not able to predict the amount of variation.
Method Objectron SUCE ImageNetVid SUCE
Softmax Vector 0.0317 0.0025 0.1148 0.0058
Temperature Scaling 0.0347 0.0023 0.2043 0.0063
Isotonic Regression (max softmax) 0.0587 0.0020 0.1268 0.0024
Isotonic Regression (entropy) 0.0590 0.0020 0.1266 0.0026
Isotonic Regression (energy) 0.0579 0.0026 0.1240 0.0035
Auxiliary Networks (logit input) 0.0401 0.0017 0.1081 0.0453
Auxiliary Networks (softmax input) 0.0396 0.0003 0.0589 0.0040
Auxiliary Networks (embedding input) 0.1251 0.0123 0.0854 0.0102
Table 6: Normalized scene entropy cannot be predicted from a single image. Entries in the table are scene uncertainty calibration errors (SUCE) averaged over three trials and accompany Figures 10 and 11. SUCE for ImageNetVid-Robust videos are improved, but the plots in Figure 10 show that this improvement is still not an accurate estimate of scene uncertainty, because correlation coefficients between empirical and estimated normalized scene entropy are at most . The SUCE for Objectron videos appear low compared to the SUCE for ImageNetVid-Robust videos because nearly all frames of most videos are correctly classified. The fact that all methods fail to improve upon the typically overconfident softmax vector from a network trained with the cross-entropy loss shows that information about scene uncertainty cannot be obtained without imputation.

c.1 Temperature Scaling

Our temperature scaling procedure consisted of adding one additional temperature parameter to each network and fine-tuning on the anchor frames of a validation dataset for 100 epochs using the cross-entropy loss and the Adam optimizer. The temperature at the epoch that produced the lowest cross-entropy loss was chosen for each trial. Temperatures reported for the three baseline networks on the Objectron dataset were 1.1761, 1.0909, and 1.1715. Temperatures reported for the three baseline networks on the ImageNetVid dataset were 1.3043, 1.3205, and 1.3121.

c.2 Isotonic Regression

We isotonically mapped several simple statistics of the softmax vector (the maximum softmax value, entropy, and energy [21]) of the anchor frames of a validation dataset onto the empirical normalized scene entropy. Isotonic curves that reflect the results of Table 6 are shown in Figure 12.

Figure 12: Isotonic curves bear little correlation to the data. In the plots above, each blue dot corresponds to a scene. The x-axis is the value of a simple statistic of the anchor frame’s softmax vector (maximum value, entropy, or energy) and the y-axis is the normalized scene entropy computed from the empirical paragon for the scene. Red lines are the learned isotonic functions that attempt to map simple statistics to normalized scene entropy. The top row contains plots for the ImageNetVid validation set and the bottom row contains plots for the Objectron validation set. The red curves do seem to fit the data, i.e. there is some correlation between these simple statistics and empirical scene entropy. However, Table 6 shows that these isotonic curves do not generalize well, and have higher SUCE than normalized scene entropy computed using the softmax vector.

c.3 Auxiliary Networks

We trained simple feedforward neural networks to map the softmax vector, logits, or embedding666We refer to the output of the second-to-last year of ResNet-50 and ResNet-101 as the embedding. of an anchor frame to the scene entropy. Networks that mapped logits and softmax vectors had hidden layers with widths 512, 512, 256, 128, 64 and those that mapped embeddings had hidden layers with widths 4096, 2048, 1024, 512, 256. Other hyperparameters, training error, and validation error are shown in Table 7. The loss function was the MSE loss between the output of the auxiliary network and the empirical scene uncertainty:

(16)

where is the auxiliary network, is its input (logits, softmax vector, or embedding), and

is the sigmoid function. Results listed in Tables

6 and 7 show that auxiliary networks effectly overfit the training data, but do not generalize to the validation or test dataset. This lack of generalization indicates that normalized scene entropy cannot be estimated using auxiliary networks.

Objectron ImageNetVid
Parameter Logit Input Softmax Input Embed. Input Logit Input Softmax Input Embed. Input
Weight Decay 1e-05 1e-05 1e-04 1e-04 1e-04 1e-05
Selected Epoch 50, 50, 50 50, 50, 50 70, 50, 50 30, 20, 20, 60, 20, 30 30, 80, 70
Training Loss at Epoch 0.0025 0.0002 0.0025 0.0002 0.0029 0.0002 0.0047 0.0001 0.0086 0.0068 0.0030 0.0003
Validation Loss at Epoch 0.0051 0.0010 0.0049 0.0012 0.0062 0.0015 0.0189 0.0014 0.0123 0.0064 0.0200 0.0005
Table 7: Auxiliary networks do not generalize. Hyperparameters, training error, and validation error for three separate trials are shown above after up to 100 epochs of Adam with a default initial learning rate of 1e-3. The simple feedforward auxiliary network is able to fit the training data, but does not generalize well to the validation set, which is also reflected in the test set results. These results further imply that the information required to predict normalized scene entropy, let along the Wellington Posterior, is not contained in the output of a single network on a single image.