1 Introduction
Representation learning through contrastive objectives has recently broken new ground, matching the performance of fully supervised methods on image classification Hjelm et al. (2018); He et al. (2020); Misra and Maaten (2020); Chen et al. (2020a, b); Grill et al. (2020); Chen and He (2021); Zbontar et al. (2021). While contrastive learning has shown strong practical results, it still lacks some important properties useful for realworld decisionmaking. One such property is uncertainty, which plays an important role in intelligent systems recognizing and preventing errors. For example, uncertainty can be leveraged to find anomalies that are outofdistribution (OOD), on which a model’s predictions may be degraded or entirely outofplace. However, current contrastive frameworks do not provide any indication of uncertainty as they learn onetoone mappings from inputs to embeddings.
Our work uses the temperature parameter to estimate the uncertainty of an input. While almost all contrastive frameworks include temperature in the objective, it historically has remained relatively unexplored compared to work on negative samples
Wu et al. (2020a); Xie et al. (2020), stop gradients Grill et al. (2020); Chen and He (2021); Zbontar et al. (2021), and transformation families Tamkin et al. (2020); Tian et al. (2020b). Recently, it has been shown that smaller temperature increases the model’s penalty on difficult negative examples Wang and Liu (2021). With this intuition, we make temperature a learned, inputdependent variable. High temperature is tantamount to the model declaring that a training input is difficult. Temperature, therefore, can be viewed a form of uncertainty. We call this simple extension to the contrastive objective, “Temperature as Uncertainty” or TaU for short.On benchmark image datasets, we show that TaU is useful for outofdistribution detection, outperforming baseline methods for extracting uncertainty such as ensembling or Bayesian posteriors over weights. We also show that one can easily derive uncertainty on top of pretrained representations, making this approach widely applicable to existing model checkpoints and infrastructure.
2 Temperature as Uncertainty
To start, we give a brief overview of contrastive learning to motivate the approach. Suppose we have a dataset of i.i.d image samples from , a distribution over a space of natural images . Let be some family of image transformations, , equipped with a distribution . The common family of visual transformations includes a random mix of cropping, color jitter, gaussian blurring, and horizontal flipping Wu et al. (2018); Tian et al. (2020a); Zhuang et al. (2019); Bachman et al. (2019); He et al. (2020); Chen et al. (2020a).
Define an encoding function that maps an image to a Lnormalized representation. Let
be parameterized by a deep neural network. The contrastive objective for the
th example is:(1) 
where represents i.i.d. samples, and is the temperature. We call transformations of the same image “positive examples” and transformations of different images “negative examples”. We chose to present Eq. 1 in a very general form based on the noise contrastive Oord et al. (2018); Gutmann and Hyvärinen (2010) lower bound to mutual information Hjelm et al. (2018); Poole et al. (2019); Wu et al. (2020b); Tian et al. (2020b), although many popular frameworks like SimCLR Chen et al. (2020a) and MoCov2 Chen et al. (2020b) can be directly derived from Eq. 1.
Since Eq. 1 uses a dot product as a distance function, the role of temperature,
is to scale the sensitivity of the loss function
Wang and Liu (2021). A closer to 0 would accentuate when representations are different, resulting in larger gradients. In the same vein, a larger would be more forgiving of such differences. In practice, varying has a dramatic impact on embedding quality. Traditionally, there are fixed values for that the authors of a contrastive framework have painstakingly tuned.We decide to learn an inputdependent temperature. In accordance with previous observations, learning an inputdependent temperature would amount to an embedding sensitivity for every input. In other words, a measure of representation uncertainty. Inputs with high temperature suggest more uncertainty as the objective is more invariant to displacements, whereas inputs with low temperature suggest less uncertainty as the objective is more sensitive to changes in embedding location.
Implementing this idea is very straightforward. We can replace in Eq. 1 with , overloading notation to define a mapping for some lower bound . We call this new objective TaU, or Temperature as Uncertainty. In practice, we edit the encoder network to return entries, the first of which are the embedding of , and the last entry being the uncertainty for . A sigmoid is used to bound to be positive and a fixed constant is used to set a lower bound for stability. See Algo. 1 for pseudocode.
3 Related Work
Learned Temperature
Methods which learn temperature can be found in supervised learning
Zhang et al. (2018); Agarwala et al. (2020), model calibration Guo et al. (2017); Neumann et al. (2018), language supervision Radford et al. (2021), and fewshot learning Oreshkin et al. (2018); Rajasegaran et al. (2020). In most of these approaches, temperature is treated as a global parameter when it is learned, not as a function of the input as in TaU. The only example, to the best of our knowledge, of learned temperature as a function of the input is Neumann et al. (2018), which uses temperature for calibration. Instead, we use temperature in the context of selfsupervised learning and apply it to OOD detection.Uncertainty in Deep Learning
There is a rich body of work in adding uncertainty to deep learning models
Gawlikowski et al. (2021); Abdar et al. (2021), of which we highlight a few. Most straightforward is ensembling of neural networks Tao (2019), where multiple copies are trained with different parameter initializations to find many local minima. Further work attempts to enforce ensemble diversity for more variance
Liu et al. (2019); Abbasi et al. (2020). Another popular approach of uncertainty is through Bayesian neural networks Goan and Fookes (2020), of which the most practical formulation is Monte Carlo dropout Gal and Ghahramani (2016). This approach frames using dropout layers during training and test time as equivalent to sampling weights from a posterior distribution over model parameters. Finally, most relevant is “hedged instance embeddings” Oh et al. (2018), which edits the contrastive encoderto map an image to a Gaussian distribution, rather than a point embedding. The primary drawbacks of this approach are (1) computational cost as it requires multiple samples, and (2) it is not proven to work in high dimensions. In our experiments, we compare these baselines to TaU.
OOD Detection
Existing OOD algorithms mostly derive outlier scores on top of predictions made by large supervised neural networks trained on the inlier dataset, such as using the maximum softmax probability
Hendrycks and Gimpel (2016), sensitivity to parameter perturbations Liang et al. (2017), or Gram matrices on activation maps Sastry and Oore (2019) as the outlier score. While these methods work very well, reaching near ceiling performance, they require human annotations, which may not be available.4 Experiments
A primary application of uncertainty is to find abnormal or anomalous inputs. We aim to show that using TaU temperatures as uncertainty is effective for outofdistribution (OOD) detection Hendrycks and Gimpel (2016); Liang et al. (2017); Sastry and Oore (2019), with the added bonus of sacrificing little to no performance on downstream tasks.
OOD Detection
We study OOD detection, where inputs from an anomalous distribution are fed to a trained model. A wellperforming metric should assign high uncertainty to these OOD inputs, thereby making it possible to classify whether an input is OOD.
We train TaU on CIFAR10 as the inlier dataset, and consider three different OOD sets: CIFAR100, SVHN, and TinyImageNet. We note that CIFAR10 and CIFAR100 are very similar in distribution, whereas SVHN is the most dissimilar. To measure performance, we compute AUROC on correctly classifying an example as OOD or not. We compare TaU to several baselines. Assuming unrestricted computational power and memory, one expensive procedure to derive uncertainties is to fit a knearest neighbors algorithm on the entire training corpus, and treat the average distance of a new example to
neighbors as an uncertainty score. We try out with working the best. This serves as an upper bound on performance. For other baselines, please refer to Sec. 3.Method (CIFAR10)  CIFAR100  SVHN  TinyImageNet 

TaU + SimCLR  0.746  0.760  
TaU + MoCov2  0.968  
SimCLR + kNN 
0.746  
MoCov2 + kNN  
SimCLR + MC Dropout Gal and Ghahramani (2016)  
Supervised + MC Dropout Gal and Ghahramani (2016)  
Hedged Instance Embedding Oh et al. (2018)  
Ensemble of 5 SimCLRs 
From Table 1, we observe that on CIFAR100 and TinyImageNet – two image corpora with similar content as CIFAR10 – TaU outperforms (or matches) all baselines, though only surpassing SimCLR + kNN by a small margin of 02%. However, for SVHN – an image corpus very different in content to CIFAR10 – TaU outperforms all baselines by at least 13%. In fact, we find most baselines do not generalize well to contrastive learning, as many perform near chance AUROC. Even prior methods specifically for contrastive uncertainty Oh et al. (2018) do not consistently perform well. The exception is using kNN with the caveat that the OOD set was not too far from the training set. Nearest neighbors fundamentally relies on a good distance function, which is achievable when the OOD input can be properly embedded. But in cases when we are truly OOD, it may not be clear where to embed an anomalous image. In these cases, as with SVHN, kNN approaches struggle.
Linear Evaluation Although we have shown that TaU uncertainties can detect OOD inputs, they would not be much good if it came at a large cost of performance on downstream tasks. To show this is not the case, we measure performance through both linear evaluation Chen et al. (2020a) and knearest neighbors on the training set Zhuang et al. (2019) (where the predicted label for a test example is the label of the closest example in the training set). Please refer to the appendix for experiment details.
Method  kNN Eval  Linear Eval 

TaU + SimCLR  
TaU + MoCov2  
SimCLR  
MoCov2 
: mean and standard deviation in accuracy are measured over three runs with different random seeds. The best performing models are bolded.
From Table 2, we observe TaU to perform only slightly worse than their deterministic counterparts, with a small reduction of 23 percentage points on both linear and knearest neighbors evaluation. While there is a nonzero cost to adding uncertainty, we believe that trading a few percentage points for a measure of confidence is practically worthwhile.
Uncertainty on Pretrained Models
We next show that TaU can be finetuned on top of pretrained models, enabling uncertainties to be generated posthoc on popular offtheshelf checkpoints. Specifically, we finetune on supervised, SimCLR, BYOL, and CLIP embeddings. All models were pretrained using ResNet50 on ImageNet with the exception of CLIP, which uses a ViT
Dosovitskiy et al. (2020). We finetune TaU uncertainties for 40 epochs, and all images were reshaped to 224 by 224 pixels.
Method (ImageNet) 
CIFAR10  CIFAR100  SVHN  TinyImgNet  LSUN  COCO  CelebA 

TaU + Supervised  0.913  0.874  0.978  0.771  0.657  0.458  0.657 
TaU + SimCLR Chen et al. (2020a)  0.823  0.870  0.968  0.747  0.552  0.554  0.717 
TaU + BYOL Grill et al. (2020)  0.763  0.808  0.955  0.686  0.471  0.497  0.840 
TaU + CLIP Radford et al. (2021)  0.056  0.044  0.071  0.154  0.779  0.579  0.883 
Table 3 reports AUROC for OOD detection for a wide survey of outlier datasets. We find that for supervised, SimCLR, and BYOL embeddings, the learned TaU uncertainties are largely able to classify OOD inputs. The exception is with COCO, likely due to a close similarity with ImageNet data points. However, CLIP surprisingly faces the opposite problem with low OOD scores for most datasets but outperforming in COCO and LSUN. Further work could explore whether CLIP’s behavior is due to differences in objective, architecture, or training.
5 Limitations and Future Work
We presented TaU, a simple method for adding uncertainty into contrastive learning objectives by repurposing temperature as uncertainty. In our experiments, we compared TaU to existing benchmark algorithms and found competitive downstream performance, in addition to TaU uncertainties being useful for outofdistribution detection. We then demonstrated how uncertainty can be added to already trained model checkpoints, enabling practitioners to reuse computation.
We discuss an important limitation: our approach is restricted to contrastive algorithms built on NCE. Other approaches, such as SimSiam Chen and He (2021), BYOL Grill et al. (2020), and Barlow Twins Zbontar et al. (2021), replace negative examples entirely with stop gradients, where we find limited success with TaU. Future work can also explore TaUlike techniques for detecting corrupted or adversarial examples.
References

[1]
(2020)
Toward adversarial robustness by diversity in an ensemble of specialized deep neural networks.
In
Canadian Conference on Artificial Intelligence
, pp. 1–14. Cited by: §3.  [2] (2021) A review of uncertainty quantification in deep learning: techniques, applications and challenges. Information Fusion. Cited by: §3.
 [3] (2020) Temperature check: theory and practice for training models with softmaxcrossentropy losses. arXiv preprint arXiv:2010.07344. Cited by: §3.
 [4] (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §2.

[5]
(2020)
A simple framework for contrastive learning of visual representations.
In
International conference on machine learning
, pp. 1597–1607. Cited by: Appendix A, §1, §2, §2, Table 3, §4.  [6] (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §1, §2.

[7]
(2021)
Exploring simple siamese representation learning.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 15750–15758. Cited by: §1, §1, §5.  [8] (2020) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §4.
 [9] (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §3, Table 1.
 [10] (2021) A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342. Cited by: §3.

[11]
(2020)
Bayesian neural networks: an introduction and survey.
In
Case Studies in Applied Bayesian Data Science
, pp. 45–87. Cited by: §3.  [12] (2020) Bootstrap your own latent: a new approach to selfsupervised learning. arXiv preprint arXiv:2006.07733. Cited by: §1, §1, Table 3, §5.
 [13] (2017) On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §3.
 [14] (2010) Noisecontrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 297–304. Cited by: §2.
 [15] (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1, §2.
 [16] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Appendix A.
 [17] (2016) A baseline for detecting misclassified and outofdistribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §3, §4.
 [18] (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §1, §2.
 [19] (2017) Enhancing the reliability of outofdistribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §3, §4.
 [20] (2019) Deep neural network ensembles against deception: ensemble diversity, accuracy and robustness. In 2019 IEEE 16th international conference on mobile ad hoc and sensor systems (MASS), pp. 274–282. Cited by: §3.
 [21] (2020) Selfsupervised learning of pretextinvariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: §1.
 [22] (2018) Relaxed softmax: efficient confidence autocalibration for safe pedestrian detection. Cited by: Appendix A, §3.
 [23] (2018) Modeling uncertainty with hedged instance embedding. arXiv preprint arXiv:1810.00319. Cited by: §3, Table 1, §4.
 [24] (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.
 [25] (2018) Tadam: task dependent adaptive metric for improved fewshot learning. arXiv preprint arXiv:1805.10123. Cited by: §3.
 [26] (2019) On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171–5180. Cited by: §2.
 [27] (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §3, Table 3.
 [28] (2020) Selfsupervised knowledge distillation for fewshot learning. arXiv preprint arXiv:2006.09785. Cited by: §3.
 [29] (2019) Detecting outofdistribution examples with indistribution examples and gram matrices. arXiv preprint arXiv:1912.12510. Cited by: §3, §4.
 [30] (2020) Viewmaker networks: learning views for unsupervised representation learning. arXiv preprint arXiv:2010.07432. Cited by: §1.
 [31] (2019) Deep neural network ensembles. In International Conference on Machine Learning, Optimization, and Data Science, pp. 1–12. Cited by: §3.
 [32] (2020) Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 776–794. Cited by: §2.
 [33] (2020) What makes for good views for contrastive learning?. arXiv preprint arXiv:2005.10243. Cited by: §1, §2.
 [34] (2021) Understanding the behaviour of contrastive loss. arXiv preprint arXiv:2012.09740v2. Cited by: §1, §2.
 [35] (2020) Conditional negative sampling for contrastive learning of visual representations. arXiv preprint arXiv:2010.02037. Cited by: §1.
 [36] (2020) On mutual information in contrastive learning for visual representations. arXiv preprint arXiv:2005.13149. Cited by: §2.
 [37] (2018) Unsupervised feature learning via nonparametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733–3742. Cited by: §2.
 [38] (2020) Delving into interimage invariance for unsupervised visual representations. arXiv preprint arXiv:2008.11702. Cited by: §1.
 [39] (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888. Cited by: Appendix A.
 [40] (2021) Barlow twins: selfsupervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230. Cited by: §1, §1, §5.
 [41] (2018) Heatedup softmax embedding. arXiv preprint arXiv:1809.04157. Cited by: §3.

[42]
(2019)
Local aggregation for unsupervised learning of visual embeddings
. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6002–6012. Cited by: §2, §4.
Appendix A Training Hyperparameters
Pretraining
For all models, we use a representation dimensionality of 128. We use the LARS optimizer [39] with learning rate 1e4, weight decay 1e6, batch size 128 for 200 epochs, as described in [5]. For baseline models (no uncertainty), we use a fixed temperature . For MoCOV2, we use negative samples with a momentum of 0.999 for updating the memory queue. We use the same optimizer as described above but with learning rate 1e3. For CIFAR10, all images are resized to 32x32 pixels; For ImageNet, all images are resized to 256x256 pixels (with 224x224 cropping size). During pretraining, we use random resized crop, color jitter, random grayscale, random gaussian blur, horizontal flipping, and pixel normalization (with ImageNet statistics). During testing, we only center crop and do pixel normalization. For encoders we train from scratch, we use ResNet18 [16] for encoders
. We adapted the Pytorch Lightning Bolts implementations of SimCLR and MoCov2, found here:
https://github.com/PyTorchLightning/lightningbolts.For the larger models, we downloaded existing SimCLR (ResNet50) checkpoints trained on ImageNet from https://github.com/googleresearch/simclr and converted it to PyTorch checkpoints using https://github.com/Separius/SimCLRv2Pytorch.
Downstream Classification
We freeze encoder parameters, remove the final L
normalization, and append a 2layer MLP with hidden dimension of 128 and ReLU nonlinearity. For optimization, we use SGD with batch size 256, learning rate 1e4, weight decay 1e6, and cosine learning rate schedule that drops at epoch 60 and 80, with a total of 100 epochs.
Optimization Stability
When optimizing the TaU objective, we found optimization instability where would either collapse to 0 or converge to if left unbounded. We found it crucial to employ some tricks for optimization stability. First, we follow Neumann et al. and have our network predict some instead of directly [22]. This changes the training dynamics but does not change the underlying equation. Second, we bound
to between 0 and 1 using a sigmoid function. Finally, we divide
by , which helps initialize the temperature to be in the same range as fixedtemperature models. When using the uncertainty for outofdistribution detection, we found that using the presigmoid worked much better than the postsigmoid , as the differences between postsigmoid values became indistinguishable using float32.
Comments
There are no comments yet.