It appears that Deep Neural Networks (DNNs) can classify images as well as humans, at least as measured by common benchmark photo collections, yet small perturbations of the images can cause changes in the predicted class. Even excluding adversarial perturbations, simply classifying consecutive frames in a video shows variability inconsistent with the reported error rate (Fig.1). So, how much should we trust image classifiers? How confident should we be of the outcome rendered on a given image? There is a substantive literature on uncertainty quantification, including work characterizing the (epistemic and aleatoric) uncertainty of trained classifiers (Sect. 1.1). Such uncertainty is a property of the classifier, not of the outcome of classifying a particular datum. We are interested in ascertaining how “confident” to be in the response of a particular DNN model to a particular image, not generally how well the classifier performs on images from a given class.
Say we have an image
, and a DNN that computes a discriminant vectorwith as many components as the number of classes (e.g.,logits or softmax) from which it returns the estimated label “cat.” How sure are we that there is a cat in this image? Even if the classifier was wrong most of the times on the class “cat,” so long as it is confident that its answer on this particular image is correct, we would be content. If faced with the question “are you sure?” a human would take a second look, or capture a new image, and either confirm or express doubt. But a DNN classifier would return the same answer, correct or not, since most real-world deep networks in use today are deterministic maps from the input to the output .111 It is common, but incorrect, to use the value of the discriminant as a measure of confidence, since the Bayesian posterior is not a minimizer of the empirical risk.
Since the classifier is deterministic, and the image is given, the key question is with respect to what variability should uncertainty be evaluated. To address the key question, we introduce the Wellington Posterior (WP) of a deterministic image classifier, which is the distribution of outcomes that would have been obtained in response to data that could have been generated by the same scene that produced the given image.
Simply put, there are no cats in images, only pixels. Cats are in the scene, of which we are given one or more images. The question of whether we are “sure” of the outcome of the an image classifier is therefore of counterfactual nature: Had we been given different images of the same scene, would an image-based classifier have returned the same outcome?
Formally, given an image
, if we could compute the posterior probability of the estimated label,
from which we can then measure confidence intervals, entropy, and other statistics commonly used to measure uncertainty, what we are after isnot . Instead, it is where and is one of the infinitely many scenes that could have yielded the given image . The Wellington Posterior is based on the distribution of images that could be generated by the unknown scene portrayed in the given image . We call the Wellington Posterior, in reference of Duke Wellington’s life endeavor to guess what is at the other side of the hill. Such guess cannot be based directly on the given data, for one cannot see behind the hill, but on hypotheses induced from by having seen behind other hills before. The scene , whether real or virtual, is the vehicle that allows one to guess what is not known () from what is known (). The crux of the matter, then, to generate the Wellington Posterior is to characterize the scene, which we discuss in Sect. 2. In particular, in Sect. 2.1 we discuss a hierarchy of models to be compared empirically in Sect. 3, with others in the Appendix. Our contributions, in relation to prior work, is discussed next.
1.1 Related Work and Contributions
Our work is related to a vast body of literature on inferring the uncertainty of a classifier, including Uncertainty Quantification (UQ) in Experimental Design and Bootstrapping. We focus specifically on deep neural networks, where uncertainty is typically attributed to the model, and only indirectly to the outcome of inference. For instance, Bayesian Neural Networks BNNBook ; BNN_survey ; bayesian_svi ; ll_svi produce not a single discriminant, but a distribution that captures epistemic uncertainty, from which one can obtain a distribution of outcomes on which to compute confidence or other uncertainty-quantifying statistic. Similarly, Dropout mc_dropout can be used at test time to generate ensemble outcomes from the same datum, and aggregate statistics to generate uncertainty estimates. These methods are seldom used in practice as they multiply the inference cost by the size of the ensemble, which is typically prohibitive.
consider the effect on the output of (heteroscedastic) Gaussian noise in the input;loquercio_general_2020 combines gast_lightweight_2018 and mc_dropout into a single system to measure both epistemic and aleatoric uncertainty. Related in spirit to our approach is CLUE , which adds a layer of counterfactual questioning on top of a Bayesian neural network to quantify the sensitivity of uncertainty estimates to changes in the input.
Calibration approaches ignore any distinction between epistemic and aleatoric uncertainty and focus on quantifying the value , where is the predicted confidence. Common approaches involve scaling the softmax vector to match the empirical distribution guo_calibration_2017 ; deep_ensembles . Bayesian approaches mc_dropout ; ll_svi ; bayesian_svi can also be used to produce confidence values, as in ovadia_can_2019 . The most common metric is the Expected Calibration Error (ECE), or the average distance between the predicted confidence and the true accuracy of samples with the same, or similar, value of , averaged over a test dataset of independent and identically distributed (i.i.d.) samples. See nixon_measuring_2019 for a detailed discussion of ECE and several variations. Another approach is based on conformal prediction angelopoulos_uncertainty_2020 ; barber_predictive_2021 , where one does not estimate uncertainty directly, but rather predict a set of answers such that , where is a design parameter. The number of elements in the set reflects the uncertainty of the network with respect to that particular input.
Our work also relates to causality pearl2009causality in a very generic sense, due to the counterfactual nature of the Wellington Posterior. However, our work is significantly different both in the methods and in the underlying philosophy, as we focus on inductive inference outside an established system of truth.
Since we consider putative, or hypothetical, variations of the input, uncertainty of the classifier is related to sensitivity, or robustness, to the input. Therefore, our work also relates to efforts to analyze the robustness or sensitivity of classifiers to perturbations in the input, even if they do not explicitly address uncertainty. We are not interested in purposefully designed (e.g., adversarial) perturbations, but rather natural perturbations, for instance due to geometric (viewpoint) or photometric changes (illumination, weather, etc.): hendrycks_benchmarking_2018
introduces the CIFAR-C, ImageNet-C, and ImageNet-P benchmark test sets. The CIFAR-C and ImageNet-C test sets contain CIFAR-10 and ImageNet images corrupted with additive noise and blur. ImageNet-P contains short videos that were created by small amounts of panning and tilting from ImageNet images. ImageNetVid-Robustimagenetvidrobust is a test set of video frames for 30 classes from the 2015 ImageNet Object Tracking Challenge imagenetvid . Frames are selected so that the object of interest remains visible. Although similar in nature, there is more variability in ImageNetVid-Robust than in ImageNet-P. See manyfacesofrobustness
for an overview. Proposed metrics for these benchmark datasets are functions of accuracy, including the Flip Probability for the ImageNet-P benchmark, i.e. the probability that the classification between consecutive frames is different. We use it as empirical ground truth to measure the uncertainty of the classifier with respect toreal, not hypothetical, different images of the same scene.
Unlike work described above, ours does not propose a new measure of uncertainty nor a new way to calibrate the discriminant to match empirical statistics: We use standard statistics computed from the “posterior probability”, such as covariance and entropy, to measure uncertainty. The core of our work aims to specify what posterior to use to measure uncertainty. Since deep network image classifiers used in the real world are deterministic, the choice is consequential, and yet seldom addressed explicitly in the existing literature. The Wellington Posterior is introduced to explicitly characterize the variability with respect to which uncertainty is measured. This is our first contribution (Sect.2).
The Wellington Posterior is not general and does not apply to any data type. It is specific to images of natural scenes. No matter how many images we are given at inference time, there are infinitely many different scenes that are compatible with them, in the sense of possibly having generated them, including real and virtual scenes. Therefore, there are many different ways of constructing the Wellington Posterior. It is not our intention to give an exhaustive account of all possible ways, and it is beyond the scope of this paper to determine which is the “best” method, for that depends on the application (e.g., closed-loop operation vs. batch post-processing), the available data (e.g., a large static dataset vs. a simulator), the run-time constraints, etc. Our second contribution is to illustrate possible ways of computing the Wellington Posterior using a hierarchy of models (Sect. 2.1) including direct manipulation of the given images, with no knowledge of the scene, and perturbations of the classifier around a pre-trained model from different scenes. These are two extremes of the modeling spectrum. In between, one can infer a 3D model of the given image with any of the available single-view reconstruction methods, (see fu2021single for a recent survey), and then use that to synthesize new images, or directly synthesize images with a GAN (see clark2019adversarial and references therein).
The third contribution
is methodological, in that we show that the Wellington Posterior can be derived without sampling input images, but rather by looking at the effect of perturbations of the input images on the parameters of the model, and imputing a distribution of the latter using a closed-form linearization of the deep network around a pre-trained point (Sect.2.2).
distance between the softmax vectors of the anchor frame and those of remaining frames shows significant spread, measured by the standard deviation as a function of mean. The “flip probability”hendrycks_benchmarking_2018 (center) measures how often the predicted class changes from frame to frame. The long tail of the histogram captures scene uncertainty. The mean and standard deviation averaged over the dataset is for the exponentially-normalized (softmax) vectors. Finally, there is also a spread in the percentage of frames within a scene classified different from the mode class (right). None of these numbers are reflective of the training error, validation error, or test error recorded in Appendix A.1.
We start by introducing the nomenclature used throughout the rest of the paper:
– The scene is an abstraction of the physical world. It can be thought of as a sample from some distribution that is unknown and arguably unknowable. The scene itself (not just its distribution) is arguably unknowable (the subject of Physics), but for some of its “attributes.”
– An attribute is a characteristics of the scene that belongs to a finite set (e.g., names of objects), . Note that there can be many scenes that share the same attribute(s) (intrinsic variability). For instance, can be the label “cat” and is the distribution of scenes that have a “cat.” Continuous, but finitely-parametrized, attributes are also possible, for instance related to shape or illumination.
– Extrinsic variability is an unknown transformation of the scene that changes its manifestation (measurements, see next point) but not its attributes. It can be thought of as a sample from some nuisance distribution . For instance, extrinsic variability222Note that there can be spurious correlations between the attribute and nuisance variability: An indoor scene is more likely to contain a cat than a beach scene. Nevertheless, if there were a cat on the beach, we would want our classifier to say so with confidence. The fact that nuisance variables can correlate with attributes on a given dataset may engender confusion between intrinsic and extrinsic variability. To be clear, phenomena that generate intrinsic variability would not exist in the absence of the attribute of interest. The pose, color and shape of a cat do not exist without the cat. Conversely, ambient illumination (indoor vs. outdoor) exists regardless of whether there is a cat, even if correlated with its presence. The effect of nuisance variability on confidence, unlike intrinsic variability, is one of correlational, rather than causal, dependence. could be due to the vantage point of the camera, the illumination, partial occlusion, sensor noise, quantization etc., none of which depends on whether the scene is labeled “cat”.
– A measurement is a known function of both the scene (and therefore its attributes) and the nuisances. We will assume that there is a generative model that, if the scene was known, and if the nuisances were known, would yield a measurement up to some residual (white, zero-mean, homoscedastic Gaussian) noise
For example, can be thought of as a graphics engine where all variables on the right hand-side are given.
– The discriminant is a deterministic function of the measurement that can be used to infer some attributes of the scene. For instance, is the Bayesian discriminant (posterior probability). More in general, could be any element of a vector (embedding) space.
– The estimated class is the outcome of a classifier, for instance . Given an image and a classifier , we reduce questions of confidence and uncertainty to the posterior probability . In the absence of any variability in the estimator , defining uncertainty in the estimate requires assuming some kind of variability. The Wellington Posterior hinges on the following assumptions:
The class is an attribute of the scene and is independent of intrinsic and extrinsic variability, by their definition.
We posit that, when asking “how confident we are about the class ” we do not refer to the uncertainty of the class given that image, which is zero. Instead, we refer to uncertainty of the estimated class with respect to the variability of all possible images of the same scene which could have been obtained by changing nuisance (extrinsic) variability.
In other words, if in response to an image, a classifier returns the label “cat,” the question is not how sure to be about whether there is a cat in the image. The question is how sure to be that there is a cat in the scene portrayed by this image. For instance, if instead of the given image, one was given a slightly different one, captured slightly earlier or a little later, and the classifier returned “dog,” would one be less confident on the answer than if it had also returned “cat”? Intuitively yes. Hypothetical repeated trials would involve not running the same image through the classifier over and over, but capturing different images of the same scene, and running each through the classifier.
Intrinsic variability does not figure in the definition of the Wellington Posterior. The fact that we are given one image implies that we are interested in one scene, the one portrayed in the image. Even though there are infinitely many scenes compatible with it, the given image defines an equivalence class of extrinsic variability. So, the question of how sure we are of the answer “cat” given an image is not how frequently the classifier correctly returns the label “cat” on different images of different scenes that contain different cats. It is a question about the particular scene portrayed by the image we are given, with the given cat in it. The goal is not , which would be how frequently we say “cat” when there is one (in some scene). We are interested in this
scene, the one portrayed by the image. Written as a Markov chain we have
where the first arrow includes intrinsic variability (a particular attribute is shared by many scenes) and the second arrow includes nuisance/extrinsic variability (a particular scene can generate infinitely many images). The last two arrows are deterministic. We are interested only in the variability in the second arrow. To compute the Wellington Posterior, we observe that
That is, given samples from the nuisance variability , or sample images generated by changing nuisance variability, , we can compute the probability of a particular label by counting the frequency of that label in response to different nuisance variability. We defer the question of whether the samples given are fair or sufficiently exciting. In the expression above, is computed by the given DNN classifier, is from a chosen class of nuisance transformations, and is an image formation model that is also chosen by the designer of the experiment. What remains to be determined is how to create a “scene” from the given image .
As we defined it, the scene is an abstraction of the physical world. Such abstraction can live inside the memory of a computer. Since a scene is only observed through images of it, if a synthetic scene generates images that are indistinguishable from those captured of a physical scene, the real and synthetic scenes – while entirely different objects – are equivalent and for the purpose of computing the Wellington Posterior. Thus a “scene” could be any generative model that can produce images that are indistinguishable from real ones including the given one. Different images are then obtained by perturbing the scene with nuisance variability.
Depending on how we measure “indistinguishable,” and how sophisticated the class of nuisance variables we can conceive of, we have a hierarchy of increasingly complex models. In addition, depending on how broadly we sample nuisance variables, that is depending on , we have a more or less representative sample from the Wellington Posterior. Below we outline the modeling choices tested in Sect. 3 and in the appendix.
2.1 Modeling Hierarchy
The simplest model we consider interprets the scene as a flat plane on which the given image is painted. Correspondingly, different images can be generated from via data augmentation, which consists of typically small group transformations of the image itself: where is the group of diffeomorphisms of the domain of the image, approximated by local affine planar transformations, and affine transformations of the range space of the image, also known as contrast transformations. These include small translations, rotations, and rescaling of the image, as well as changes of the colormap. In addition to affine domain and range transformations, one can also add i.i.d. Gaussian noise and paste small objects in the image, to simulate occlusion nuisances. Data augmentation is typically used to train classifiers to be robust, or insensitive, to the class of chosen transformations, which correspond to a rudimentary model of the scene, so we chose it as the baseline in our experiments in Sect. 3.
Explicit 3D scene reconstruction.
The model of the scene implicit in data augmentation does not take into account parallax, occlusions, and other fundamental phenomena of image formation. One step more general, we could use knowledge of the shape and topology of other scenes, manifest in a training set, to predict (one or more) scenes from a single image. Since there are infinitely many, the one(s) we predict is a function of the inductive biases implicit and explicit in the method. There there are hundreds of single-view reconstruction methods, including so-called “photo pop-up” hoiem2005automatic used for digital refocusing and computational photography, well beyond what we can survey in this paper. Our limited experiments show similar performance to data augmentation. Formally, is a scene compatible with the given image , which is represented as a distribution depending on a dataset of scenes and corresponding images from it. These can be obtained through some ground truth mechanism, for instance a range sensor or a simulation engine.
Conditional Generative Adversarial Networks.
Finally, Conditional Generative Adversarial Networks that generates other similar images or video conditioned on an image, may also be a source of imputation. In particular, cGAN was designed for data augmentation.
2.2 Analytical Posterior through Linearization.
All the models above aim to generate a sample from the distribution . We showed a few examples of possible methods for generating a scene and sampling from it, none viable in practice since they would require explicit sampling. Instead, we seek for analytical expressions of the Wellington Posterior that do not require multiple inference runs. Recent development in network linearization suggest that it is possible to perturb the weights of a trained network locally to perform novel tasks essentially as well as non-linear optimization/fine-tuning achille_lqf_2020 . Such linearization, called LQF, is with respect to perturbations of the weights of the model. Therefore, we use the closed-form analytical Jacobian of the linearized network to compute first and second-order statistics (mean and covariance) of the discriminant, without the need for sampling. The first-order approximation of the network around the pre-trained weights is
In equation (4), has a closed-form solution that is a function of the training data. Then, equation (4) can be used to define a stochastic network with weights distributed according to for any desired covariance . Correspondingly, the discriminant and logits can be assumed to be distributed as
Given a diagonal or block-diagonal value of , we compute using forward passes and backward passes per input image over the neural network. This is not a fast computation, but requires far less resources than a naive implementation. More implementation details are given in Appendix B.
Next, given , the probability that class is the predicted class is given by:
The above is a -dimensional integral. The lower limit of integration in every dimension is while the upper limit is variable. For , , and , equation (6) is:
where is the density of :
where is the -th element along the diagonal of and . The full derivation is given in Appendix B. Equation (9) has no closed form that we know of, but can be numerically approximated using a quadrature method.
For values of that are not well-approximated by their diagonals, statistics of the distribution in equation (5) can be approximated and compared them to the empirical paragon in Section 3 through sampling. This is, of course, not ideal, but computationally fast.
distances between the predicted and empirical class distributions. LQF has the lowest skew.
We performed experiments on two datasets, Objectron objectron2021 and the ImageNetVid and ImageNetVid-Robust datasets described in Sect. 1.1. Objectron objectron2021 contains short video clips of scenes with 9 classes: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes. We split it into a training and a validation set, with an equal number of videos of each class in each split. Then, we selected a subset of 5000 videos for training set, 500 videos for validation set, and 500 for testing.
In order to compute the empirical posterior and the various forms of Wellington Posterior, we need a discriminant function , from which to build a classifier. We use the backbone of an ImageNet-pretrained ResNet-50 or ResNet-101 resnet , where is called the vector of logits, whose maximizer is the selected class, and whose normalized exponential is called softmax vector. With an abuse of notation, when clear from the context, we refer to as both the vector of logits and the softmax. Fine-tuning and validation for ResNet-50 on Objectron, and ResNet-101 on ImageNetVid used one frame from each scene. The focus is not to achieve the highest possible accuracy, but to provide a meaningful estimate of uncertainty relative to the variability of different images of the same scene. For this reason, we select the most common, not the highest performing, baseline classifier. We achieve > 95% validation accuracy for Objectron and 75% validation accuracy for ImageNetVid.
Ground Truth and Empirical Baseline.
Validating the Wellington Posterior estimated from a single image requires a ground-truth distribution. This can be obtained empirically by computing different discriminants, and correspondingly the distribution of of different outcomes, from an actual sample of different images of the same scene. For this purpose, we use short videos from the Objectron and ImageNetVid-Robust datasets. Such short videos are not a fair sample from the population of all possible images that the scene could have generated, so the resulting statistics cannot be considered proper ground truth. However, the video represents an empirical baseline in line with the definition of Wellington Posterior, so we adopt it as a paragon for evaluation. For each frame, we compute the discriminant vector of logits, its softmax-normalized version, and the highest-scoring hypothesis. From these, we can compute summary statistics such as frequency of each label in the categorical distribution, or mean and covariance of the logits as if coming from a Gaussian distribution in-dimensional space. Our experiments focus on estimating the mean and covariance of the logits and the frequencies of each label in the categorical distribution.
|Objectron (ResNet-50)||ImageNetVid (ResNet-101)|
|Categorical Dist.||Logit Mean||Logit Covariance||Categorical Dist.||Logit Mean||Logit Covariance|
|Data Augmentation||0.0789 0.0038||5.4253 0.1225||15.6404 0.1681||0.2300 0.0123||3.9731 0.1063||9.0627 0.4604|
|Dropout||0.0512 0.0048||3.7304 0.0743||12.8472 0.6759||0.1746 0.0042||3.7304 0.0743||12.8472 0.6759|
|Deep Ensembles||0.0575 0.0028||5.6469 0.1106||19.2226 0.2474||0.2160 0.0102||4.4732 0.1331||19.2226 0.2474|
|LQF||0.0579 0.0034||3.2624 0.0293||7.3136 0.2869||0.1220 0.0029||3.2624 0.0293||7.3136 0.2869|
|Monocular Depth Est.||0.0825 0.0075||5.0966 0.0634||16.0126 0.3195||-||-||-|
We use three metrics to validate predicted Wellington Posteriors against the empirical distribution computed using multiple frames per video: the distance between the mean logit and the empirical mean logit for each scene, the Frobenius distance between the sample covariance of the empirical and predicted logits, and the distance between the probability vectors of the empirical and predicted categorical distributions. These metrics illustrate first and second order differences in the WP and empirical distribution of the logits and a first order difference in the distribution of the predictions.
Common Pseudo-Uncertainty and Confidence Scores.
As we pointed out in Footnote 1, it is common to use the values of the Softmax vector as a proxy of uncertainty. In some cases, the vector is “temperature-scaled” guo_calibration_2017 ; in other cases, the entropy of the softmax vector or the energy liu_energy-based_2020 are used as proxy of uncertainty. Additional experiments on isotonic calibration and the use of auxiliary networks detailed in Appendix C indicate that there is no strong connection between these pseudo-uncertainties and the empirical posterior.
To compute results for the simple baseline, we use the Albumentations imaging library buslaev_albumentations_2020 to perform data augmentation using the following procedure:
Horizontal flip with probability 0.5
Shift by up to 20 percent of image size in both horizontal and vertical directions
Scale randomly selected from
Safe rotation randomly selected from degrees
Small perturbations of the image are dual to small perturbations of the weights in the first layer, so one could generate a posterior estimate over outcomes to a single image by considering variability of the classifier. Bayesian Neural Networks and VAEs kingma2019introduction operate on this principle, but are not in use as image classifiers due to their lower performance and higher cost. As a representative of this class of methods, we consider ensembles of 20 neural networks, formed by withholding a random 5% of the training data and through random shuffling of the data. The Wellington Posterior generated from an ensemble is the distribution of class predictions among the 20 members’ response to a single frame, shown in Fig. 2 (c). We also use a (pseudo)-ensemble generated by Monte-Carlo Dropout in Fig. 2 (b).
Single-view 3D reconstruction.
One step beyond data augmentation is to manipulate (a coarse reconstruction of) the scene, rather than the image directly. We use VOICED to generate a disparity map, and thence a depth map, from a single image – based on a training set of images and corresponding ground-truth depth obtained from a range sensor.333Objectron is a 3D tracking dataset and VOICED is a depth completion network, so we leveraged the sparse depth measurements associated with the anchor frame to produce a depth map, but no other frames. We chose to make this slight deviation from our model hierarchy because it produced realistic images without extensive fine-tuning experiments; this experiment should be considered an upper-bound on the performance of single-view 3D reconstruction for Objectron. More details are given in Appendix A.2. This yields a point estimate from , where is the dataset used for training the single-view reconstruction model. We then apply small spatial deformations , or , and render images through standard texture-mapping including modulation of the range space with small contrast transformations of the color map. Performance of this approach relative to the baseline is reported in Table 1 for the Objectron dataset. Existing off-the-shelf single view reconstruction methods did not yield sensible images on ImageNetVid-Robust; example results are shown in Appendix A.3.
Conditional Generative Adversarial Networks.
State of the art Conditional GANs do not generalize well to out-of-distribution data. Appendix A.4 contains some examples of generated images from a pretrained network in cGAN conditioned on Objectron anchor images. Since extensive experiments are beyond our scope here, we do not investigate GANs further.
Implicit Local Generation.
The viability of the method we propose to impute uncertainty to a deterministic classifier hinges on the ability to produce an estimate without having to sample multiple inference runs at test time. Recent work on model linearization around a pre-trained point achille_lqf_2020 has shown that it is possible to obtain performance comparable to that of full-network non-linear fine-tuning. In this sense, LQF can be used as a baseline classifier instead of the pre-trained network. The main advantage is that LQF allows explicit computation of the covariance of the discriminant without the need to sample multiple input data. The covariance of the discriminant is computed from (5) with a design parameter, which we choose to match the distribution generated by the ensemble. Then, we can compute the distribution of in (5) using forward passes and backward passes. Then, a categorical distribution can be computed using equation (9). The WP generated using LQF is compared to the empirical paragon in Fig. 2 (d). Quantitative results are summarized in Table 1.
Our method has many limitations: First, there is no “right” model, so evaluating the WP, and even choosing a veritable “ground truth,” presents challenges. Our empirical analysis is limited by the availability of datasets in the public domain, none of which comprises a fair sample of the distribution of images of the same scene. Nonetheless, 20 frames of video provide better support for uncertainty validation than a single image. A better approach would be to have a fully controlled, realistic simulator, in combination with more limited sets of real data.
Second, for sampling-based methods, we have no way of ascertaining whether the sample generated from the (implicit or explicit) scene is sufficiently exciting, in the sense of exciting all the modes of variation of the Wellington Posterior. The only evidence we have thus far is that such variability is better aligned with the empirical distribution than simple image-based data augmentation. More importantly, sampling-based approaches are prohibitive in reality, especially for closed-loop operation, which is where uncertainty estimates are most important.
Third, the only approach for non-sampling approximation of the WP is based on linearization. While models beyond linear are unlikely to be tractable, there may be still better models for inferring second-order statistics either directly from the primary trained network, as done in Variational Autoencoders, or through an auxiliary network.
The weakest aspect of our modeling framework is the limitation of some models to small nuisance perturbations. This is mildly relaxed for single-view 3D reconstruction (scene transformations beyond small ones create gaps and holes in the reconstruction), and less so for CPNs and GANs. Again, the linearized model is constrained to transformations of the images induced by small perturbations in the weights. This essentially guarantees that the Wellington Prior will not cover the domain of the true posterior.
We have provided initial empirical evaluation, but a thorough and rigorous testing is beyond the scope of this paper and will have to be performed in an application-specific manner. For example, evaluating the uncertainty of a label associated to an image to decide whether to feed that image to a human annotator is an entirely different problem than evaluating uncertainty for the purpose of deciding whether to drive through a city street. We have focused our analysis on the extremes: A baseline based on data augmentation, ensembles, and an analytical characterization of the WP. More methods based on explicit generation, such as CPNs, GANs, or single-view 3D reconstruction, can be further explored. However, all these methods would require explicit sampling of images and repeated inference runs to generate the empirical statistics, which is unfeasible in real-world applications.
Inspection of the Markov Chain in Sect. 1 suggests that the Wellington Posterior goes against the Data Processing inequality: contains no more information than , including on the uncertainty of the outcome . This, however, is not the case as there is inductive bias in the way the scene is constructed, which typically involves additional information manifest in the training set or the design of the nuisance variability used in the rendering process . In a sense, the Wellington Prior attributes an uncertainty to the current image by associating it to scenes, other than the present one, that might have yielded images similar to the one given, sometime in the past – represented by the training set. Beyond the additional information from the training set, the Wellington Posterior acts as a regularization mechanisms by exploiting regularities from different scenes to hypothesize variability of images of the present one.
-  Alessandro Achille, Aditya Golatkar, Avinash Ravichandran, Marzia Polito, and Stefano Soatto. LQF: Linear Quadratic Fine-Tuning. arXiv:2012.11140 [cs, stat], December 2020. arXiv: 2012.11140.
-  Adel Ahmadyan, Liangkai Zhang, Artsiom Ablavatski, Jianing Wei, and Matthias Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. , 2021.
-  Anastasios Nikolas Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty Sets for Image Classifiers using Conformal Prediction. September 2020.
-  Javier Antoran, Umang Bhatt, Tameem Adel, Adrian Weller, and José Miguel Hernández-Lobato. Getting a CLUE: A Method for Explaining Uncertainty Estimates. September 2020.
-  Rina Foygel Barber, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. Predictive inference with the jackknife+. The Annals of Statistics, 49(1):486–507, February 2021. Publisher: Institute of Mathematical Statistics.
-  Alexander Buslaev, Alex Parinov, Eugene Khvedchenya, Vladimir I. Iglovikov, and Alexandr A. Kalinin. Albumentations: fast and flexible image augmentations. Information, 11(2):125, February 2020. arXiv: 1809.06839.
-  Lucy Chai, Jun-Yan Zhu, Eli Shechtman, Phillip Isola, and Richard Zhang. Ensembling with deep generative views. In CVPR, 2021.
-  Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019.
Kui Fu, Jiansheng Peng, Qiwen He, and Hanxiao Zhang.
Single image 3d object reconstruction based on deep learning: A review.Multimedia Tools and Applications, 80(1):463–498, 2021.
Yarin Gal and Zoubin Ghahramani.
Dropout as a Bayesian Approximation: Representing Model
Uncertainty in Deep Learning.
International Conference on Machine Learning, pages 1050–1059, June 2016.
-  Jochen Gast and Stefan Roth. Lightweight Probabilistic Deep Networks. pages 3369–3378, 2018.
-  Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 1321–1330. JMLR.org, 2017. event-place: Sydney, NSW, Australia.
-  Fredrik K. Gustafsson, Martin Danelljan, and Thomas B. Schon. Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision. pages 318–319, 2020.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. pages 770–778, 2016.
-  Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. arXiv:2006.16241 [cs, stat], August 2020. arXiv: 2006.16241.
-  Dan Hendrycks and Thomas Dietterich. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. September 2018.
-  Derek Hoiem, Alexei A Efros, and Martial Hebert. Automatic photo pop-up. In ACM SIGGRAPH 2005 Papers, pages 577–584. 2005.
-  Alex Kendall and Yarin Gal. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5574–5584. Curran Associates, Inc., 2017.
-  Diederik P Kingma and Max Welling. An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691, 2019.
-  Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 6402–6413. Curran Associates, Inc., 2017.
-  Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based Out-of-distribution Detection. Advances in Neural Information Processing Systems, 33, 2020.
-  Antonio Loquercio, Mattia Segu, and Davide Scaramuzza. A General Framework for Uncertainty Estimation in Deep Learning. IEEE Robotics and Automation Letters, 5(2):3153–3160, April 2020.
-  Radford M. Neal. Bayesian Learning for Neural Networks. Springer Science & Business Media, December 2012.
-  Jeremy Nixon, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring Calibration in Deep Learning. pages 38–41, 2019.
-  Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. Advances in Neural Information Processing Systems, 32:13991–14002, 2019.
-  Judea Pearl. Causality. Cambridge university press, 2009.
Carlos Riquelme, George Tucker, and Jasper Snoek.
Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling.February 2018.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
-  Vaishaal Shankar, Achal Dave, Rebecca Roelofs, Deva Ramanan, Benjamin Recht, and Ludwig Schmidt. Do Image Classifiers Generalize Across Time? arXiv:1906.02168 [cs, stat], December 2019. arXiv: 1906.02168.
-  Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, and Roger Grosse. Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches. February 2018.
-  Alex Wong, Xiaohan Fei, Stephanie Tsuei, and Stefano Soatto. Unsupervised Depth Completion From Visual Inertial Odometry. IEEE Robotics and Automation Letters, 5(2):1899–1906, April 2020. Conference Name: IEEE Robotics and Automation Letters.
-  Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of virtual normal for depth prediction. In The IEEE International Conference on Computer Vision (ICCV), 2019.
-  Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’02, pages 694–699, New York, NY, USA, July 2002. Association for Computing Machinery.
Appendix A Additional Experiment Details and Results
a.1 Network Hyperparameters and Classifier Performance
Hyperparameters, training error, and validation error for training standard image classifiers using the cross entropy loss are given in Table 2. Hyperparameters, training error, and validation error for training LQF networks are given in Table 3
. Pretrained ImageNet networks were fine-tuned for one epoch on ImageNet with ReLUs replaced by LeakyReLUs prior to LQF with the hyperparameters in Table4. When training all networks, we saved checkpoints along with its training and validation accuracy every 10 epochs. We used a combination of training and validation errors to select an appropriate checkpoint for the rest of the experiments. Training networks with dropout for the experiments with dropout used the same hyperparameters as standard networks. Training and validation errors are shown in Table 5.
To train ensembles of standard networks and LQF networks, we used the same hyperparameters and procedure as for the baseline networks. The only difference is that each ensemble saw different data shuffling and a different 90% fraction of the data.
Test accuracy on the Objectron test set was 0.9607 0.0015 for all frames and 0.9693 0.0076 for only the anchor frames. Test accuracy on ImageNetVid-Robust was 0.7187 0.0084 for all frames and 0.7207 0.0139 for only the anchor frames. These test errors do not reflect the variation shown in Figures 1 and 7 for ImageNetVid-Robust and Objectron, respectively.
|Initial Learning Rate||0.01||0.001|
|Milestones||25, 35||25, 35|
|Selected Epoch||50, 40, 40||30, 50, 50|
|Training Error at Selected Epoch||0.0326, 0.0366, 0.0318||0.0478, 0.0478, 0.0476|
|Validation Error at Selected Epoch||0.0320, 0.0220, 0.0320||0.2747, 0.2711, 0.2912|
Hyperparameters and decision variables used in fine-tuning image classifiers on the Objectron and ImageNetVid datasets using stochastic gradient descent and the cross-entropy loss for all three trials.
|Initial Learning Rate||5e-4||5e-4|
|Selected Epoch||50, 40, 50||40, 30, 40|
|Training Error at Selected Epoch||0.0186, 0.0220, 0.0180||0.0500, 0.0542, 0.0457|
|Validation Error at Selected Epoch||0.0000, 0.0200, 0.0280||0.2692, 0.2601, 0.2674|
|Parameter||Value (ResNet-50)||Value (ResNet-101)|
|Selected Epoch||40, 50, 50||50, 50, 50|
|Training Error at Selected Epoch||0.0490, 0.0504, 0.0488||0.1110, 0.1079, 0.1084|
|Validation Error at Selected Epoch||0.0380, 0.0240, 0.0380||0.2527, 0.2619, 0.2564|
a.2 Explicit 3D Scene Reconstruction for Objectron
Since the Objectron dataset was originally created for 3D object tracking, the dataset contains a sparse 3D point cloud associated each video frame and camera intrinsics. The points in the 3D point cloud are the output of a tracking algorithm, not a range sensor, making it an appropriate use case for VOICED , a depth completion algorithm. Results reported in Table 1 are computed using a pretrained network trained on the VOID dataset distributed with the VOICED source code.
Once a 3D depth map is computed, we rotate and translate all points by a randomly sampled small rotation and translation. Rotations are roll, pitch, yaw values sampled from the uniform distribution. Translations are sampled from the uniform distribution where is the mean estimated depth in meters. Finally, images were projected back onto the 2D plane and inpainted using standard techniques to fill in any gaps. Ultimately, We found that fine-tuning was not necessary to create realistic-looking reconstructions and reprojections. Example reprojections are shown in Figure 4.
3D depth completion, as opposed to monocular depth estimation described in Section 2, is both easier and more realistic for the reconstruction depends not only on the prior, but also on the sparse range measurements. The resulting performance should be considered an upper bound of achievable performance with just monocular 3D reconstruction. The choice is opportunistic since the Objectron dataset affords it and the reconstruction is more realistic than single-view 3D reconstruction.
a.3 Explicit 3D Scene Reconstruction for ImageNetVid-Robust
We attempted to use state-of-the-art monocular 3D depth estimation  with small camera movements drawn from the same distributions as in the previous section. The depth estimation network was pretrained on the KITTI dataset of outdoor driving scenes. Since the ImageNetVid-Robust dataset does not contain any camera intrinsics, we assumed that the center of the camera apeture was located in the middle of the frame and that the focal length was equal to 1. The results, shown in Figure 5, led us to conclude that 3D monocular depth estimation on ImageNetVid images is not possible without fine-tuning. Furthermore, the lack of any (sparse or dense) depth information or camera intrinsics will make fine-tuning extremely difficult.
a.4 GAN Ensembling for Objectron
We attempted to use a pretrained network from  to impute scenes from anchor frames in the Objectron test set. The results, shown in Figure 6, show that imputing scenes using a conditional GAN will require fine-tuning, which we defer to future work.
a.5 Computational Resources
All experiments were performed on a workstation with the following specs:
CPU (8 cores): Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz
RAM: 64 GB
GPU 0: NVIDIA TITAN V, 12GB RAM
GPU 1: NVIDIA GeForce GTX 1080, 11GB RAM
a.6 Objectron Figures
Analogs to Figures 1, 2, and 3 for the Objectron dataset are shown in Figures 7, 8, and 9. Compared to ImageNetVid, there is less variability in the predictions, but far more variability in the logits.
Appendix B Logit Covariance Using LQF
b.1 Computing Logit Covariance from Diagonal Weight Covariance
Calculating from using standard matrix multiplication requires more computer memory than is available on a typical workstation. Therefore, we compute column by column using the following procedure that performs forward passes and backward passes per image. Please see function getOutCov in our attached source code for more detail:
For each class , corresponding to column of :
In the network object, set all values of to 0.
Run a backward pass. PyTorch will compute and store , even though is not actually dependent on .
Multiply the stored values of by the values in the diagonal approximation of .444We accomplish this step using an implementation of a simple preconditioner that edits the values of stored gradients. Copy these values to the variable .555We accomplish this using SGD’s update step with a learning rate of 1.0.
Run a second forward pass to compute . is column of .
The above procedure is complex and uses PyTorch in ways it was not meant to be used. Next, forward passes and backward passes per image, although computationally feasible, is still too slow to be practical. The results in Table 1 motivate future work on more practical ways to compute . Using our computational resources, computing for all anchor images of the ImageNetVid test set requires 3 hours with one GPU.
b.2 Finding a Value for Weight Covariance
In our experiments, we set
to the sample variance of the linear weights. The sample variance came from an ensemble of 20 trained LQF networks. However, it is also admissible to treatas a tuning parameter over a validation dataset, the only requirement is that be a positive semi-definite matrix. For now, though, the computationally slow procedure to compute from described in the previous section makes the sample variance the most practical value for , as each iteration of tuning requires hours while training an ensemble of 20 LQF networks requires only hours and yields reasonable results.
b.3 Transforming a Normal Distribution into Class Probabilities
Next, introduce the change of variables and let . The above integral then becomes:
Using the identity
Appendix C Pseudo-Uncertainties and Normalized Scene Entropy
This section details preliminary work that ruled out the softmax vector and other pseudo-uncertainties from our model hierarchy, as stated in Section 3. Let be the probability vector of the empirical categorical distribution and be the elements of . Then, the entropy of is
The maximum possible entropy for any classification problem with classes is:
Therefore, the normalized scene entropy
may be taken as a one possible measure of uncertainty based on the scene distribution. Note that predicting scene entropy or normalized scene entropy is not the same as creating an accurate Wellington Posterior, as two different distributions may have the same entropy. Therefore, predicting normalized scene entropy is an easier problem than creating an accurate Wellington Posterior, as two different categorical distributions may have the same entropy. We evaluate predictions of normalized scene entropy using the scene uncertainty calibration error (SUCE) computed over the anchor frames of a dataset of videos:
where is the normalized scene entropy computed using the empirical paragon and is scene entropy estimated using information from a single anchor frame. To estimate scene entropy, we consider the softmax vector, temperature scaled softmax vectors, isotonic regression , and auxiliary networks. Summary results for both ImageNetVid and Objectron are shown in Table 6. More details on each particular method are given in the subsections below.
|Method||Objectron SUCE||ImageNetVid SUCE|
|Softmax Vector||0.0317 0.0025||0.1148 0.0058|
|Temperature Scaling||0.0347 0.0023||0.2043 0.0063|
|Isotonic Regression (max softmax)||0.0587 0.0020||0.1268 0.0024|
|Isotonic Regression (entropy)||0.0590 0.0020||0.1266 0.0026|
|Isotonic Regression (energy)||0.0579 0.0026||0.1240 0.0035|
|Auxiliary Networks (logit input)||0.0401 0.0017||0.1081 0.0453|
|Auxiliary Networks (softmax input)||0.0396 0.0003||0.0589 0.0040|
|Auxiliary Networks (embedding input)||0.1251 0.0123||0.0854 0.0102|
c.1 Temperature Scaling
Our temperature scaling procedure consisted of adding one additional temperature parameter to each network and fine-tuning on the anchor frames of a validation dataset for 100 epochs using the cross-entropy loss and the Adam optimizer. The temperature at the epoch that produced the lowest cross-entropy loss was chosen for each trial. Temperatures reported for the three baseline networks on the Objectron dataset were 1.1761, 1.0909, and 1.1715. Temperatures reported for the three baseline networks on the ImageNetVid dataset were 1.3043, 1.3205, and 1.3121.
c.2 Isotonic Regression
We isotonically mapped several simple statistics of the softmax vector (the maximum softmax value, entropy, and energy ) of the anchor frames of a validation dataset onto the empirical normalized scene entropy. Isotonic curves that reflect the results of Table 6 are shown in Figure 12.
c.3 Auxiliary Networks
We trained simple feedforward neural networks to map the softmax vector, logits, or embedding666We refer to the output of the second-to-last year of ResNet-50 and ResNet-101 as the embedding. of an anchor frame to the scene entropy. Networks that mapped logits and softmax vectors had hidden layers with widths 512, 512, 256, 128, 64 and those that mapped embeddings had hidden layers with widths 4096, 2048, 1024, 512, 256. Other hyperparameters, training error, and validation error are shown in Table 7. The loss function was the MSE loss between the output of the auxiliary network and the empirical scene uncertainty:
where is the auxiliary network, is its input (logits, softmax vector, or embedding), and
is the sigmoid function. Results listed in Tables6 and 7 show that auxiliary networks effectly overfit the training data, but do not generalize to the validation or test dataset. This lack of generalization indicates that normalized scene entropy cannot be estimated using auxiliary networks.
|Parameter||Logit Input||Softmax Input||Embed. Input||Logit Input||Softmax Input||Embed. Input|
|Selected Epoch||50, 50, 50||50, 50, 50||70, 50, 50||30, 20, 20,||60, 20, 30||30, 80, 70|
|Training Loss at Epoch||0.0025 0.0002||0.0025 0.0002||0.0029 0.0002||0.0047 0.0001||0.0086 0.0068||0.0030 0.0003|
|Validation Loss at Epoch||0.0051 0.0010||0.0049 0.0012||0.0062 0.0015||0.0189 0.0014||0.0123 0.0064||0.0200 0.0005|