Temperature as Uncertainty in Contrastive Learning

10/08/2021 ∙ by Oliver Zhang, et al. ∙ Stanford University 3

Contrastive learning has demonstrated great capability to learn representations without annotations, even outperforming supervised baselines. However, it still lacks important properties useful for real-world application, one of which is uncertainty. In this paper, we propose a simple way to generate uncertainty scores for many contrastive methods by re-purposing temperature, a mysterious hyperparameter used for scaling. By observing that temperature controls how sensitive the objective is to specific embedding locations, we aim to learn temperature as an input-dependent variable, treating it as a measure of embedding confidence. We call this approach "Temperature as Uncertainty", or TaU. Through experiments, we demonstrate that TaU is useful for out-of-distribution detection, while remaining competitive with benchmarks on linear evaluation. Moreover, we show that TaU can be learned on top of pretrained models, enabling uncertainty scores to be generated post-hoc with popular off-the-shelf models. In summary, TaU is a simple yet versatile method for generating uncertainties for contrastive learning. Open source code can be found at: https://github.com/mhw32/temperature-as-uncertainty-public.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: CIFAR10 Images on the left have high TaU certainty while images on the right have low TaU certainty.

Representation learning through contrastive objectives has recently broken new ground, matching the performance of fully supervised methods on image classification Hjelm et al. (2018); He et al. (2020); Misra and Maaten (2020); Chen et al. (2020a, b); Grill et al. (2020); Chen and He (2021); Zbontar et al. (2021). While contrastive learning has shown strong practical results, it still lacks some important properties useful for real-world decision-making. One such property is uncertainty, which plays an important role in intelligent systems recognizing and preventing errors. For example, uncertainty can be leveraged to find anomalies that are out-of-distribution (OOD), on which a model’s predictions may be degraded or entirely out-of-place. However, current contrastive frameworks do not provide any indication of uncertainty as they learn one-to-one mappings from inputs to embeddings.

Our work uses the temperature parameter to estimate the uncertainty of an input. While almost all contrastive frameworks include temperature in the objective, it historically has remained relatively unexplored compared to work on negative samples

Wu et al. (2020a); Xie et al. (2020), stop gradients Grill et al. (2020); Chen and He (2021); Zbontar et al. (2021), and transformation families Tamkin et al. (2020); Tian et al. (2020b). Recently, it has been shown that smaller temperature increases the model’s penalty on difficult negative examples Wang and Liu (2021). With this intuition, we make temperature a learned, input-dependent variable. High temperature is tantamount to the model declaring that a training input is difficult. Temperature, therefore, can be viewed a form of uncertainty. We call this simple extension to the contrastive objective, “Temperature as Uncertainty” or TaU for short.

On benchmark image datasets, we show that TaU is useful for out-of-distribution detection, outperforming baseline methods for extracting uncertainty such as ensembling or Bayesian posteriors over weights. We also show that one can easily derive uncertainty on top of pretrained representations, making this approach widely applicable to existing model checkpoints and infrastructure.

2 Temperature as Uncertainty

To start, we give a brief overview of contrastive learning to motivate the approach. Suppose we have a dataset of i.i.d image samples from , a distribution over a space of natural images . Let be some family of image transformations, , equipped with a distribution . The common family of visual transformations includes a random mix of cropping, color jitter, gaussian blurring, and horizontal flipping Wu et al. (2018); Tian et al. (2020a); Zhuang et al. (2019); Bachman et al. (2019); He et al. (2020); Chen et al. (2020a).

Define an encoding function that maps an image to a L-normalized representation. Let

be parameterized by a deep neural network. The contrastive objective for the

-th example is:

(1)

where represents i.i.d. samples, and is the temperature. We call transformations of the same image “positive examples” and transformations of different images “negative examples”. We chose to present Eq. 1 in a very general form based on the noise contrastive Oord et al. (2018); Gutmann and Hyvärinen (2010) lower bound to mutual information Hjelm et al. (2018); Poole et al. (2019); Wu et al. (2020b); Tian et al. (2020b), although many popular frameworks like SimCLR Chen et al. (2020a) and MoCo-v2 Chen et al. (2020b) can be directly derived from Eq. 1.

Algorithm 1 TaU + SimCLR

Since Eq. 1 uses a dot product as a distance function, the role of temperature,

is to scale the sensitivity of the loss function

Wang and Liu (2021). A closer to 0 would accentuate when representations are different, resulting in larger gradients. In the same vein, a larger would be more forgiving of such differences. In practice, varying has a dramatic impact on embedding quality. Traditionally, there are fixed values for that the authors of a contrastive framework have painstakingly tuned.

We decide to learn an input-dependent temperature. In accordance with previous observations, learning an input-dependent temperature would amount to an embedding sensitivity for every input. In other words, a measure of representation uncertainty. Inputs with high temperature suggest more uncertainty as the objective is more invariant to displacements, whereas inputs with low temperature suggest less uncertainty as the objective is more sensitive to changes in embedding location.

Implementing this idea is very straightforward. We can replace in Eq. 1 with , overloading notation to define a mapping for some lower bound . We call this new objective TaU, or Temperature as Uncertainty. In practice, we edit the encoder network to return entries, the first of which are the embedding of , and the last entry being the uncertainty for . A sigmoid is used to bound to be positive and a fixed constant is used to set a lower bound for stability. See Algo. 1 for pseudo-code.

3 Related Work

Learned Temperature

Methods which learn temperature can be found in supervised learning

Zhang et al. (2018); Agarwala et al. (2020), model calibration Guo et al. (2017); Neumann et al. (2018), language supervision Radford et al. (2021), and few-shot learning Oreshkin et al. (2018); Rajasegaran et al. (2020). In most of these approaches, temperature is treated as a global parameter when it is learned, not as a function of the input as in TaU. The only example, to the best of our knowledge, of learned temperature as a function of the input is Neumann et al. (2018), which uses temperature for calibration. Instead, we use temperature in the context of self-supervised learning and apply it to OOD detection.

Uncertainty in Deep Learning

There is a rich body of work in adding uncertainty to deep learning models

Gawlikowski et al. (2021); Abdar et al. (2021), of which we highlight a few. Most straightforward is ensembling of neural networks Tao (2019)

, where multiple copies are trained with different parameter initializations to find many local minima. Further work attempts to enforce ensemble diversity for more variance

Liu et al. (2019); Abbasi et al. (2020). Another popular approach of uncertainty is through Bayesian neural networks Goan and Fookes (2020), of which the most practical formulation is Monte Carlo dropout Gal and Ghahramani (2016). This approach frames using dropout layers during training and test time as equivalent to sampling weights from a posterior distribution over model parameters. Finally, most relevant is “hedged instance embeddings” Oh et al. (2018), which edits the contrastive encoder

to map an image to a Gaussian distribution, rather than a point embedding. The primary drawbacks of this approach are (1) computational cost as it requires multiple samples, and (2) it is not proven to work in high dimensions. In our experiments, we compare these baselines to TaU.

OOD Detection

Existing OOD algorithms mostly derive outlier scores on top of predictions made by large supervised neural networks trained on the inlier dataset, such as using the maximum softmax probability

Hendrycks and Gimpel (2016), sensitivity to parameter perturbations Liang et al. (2017), or Gram matrices on activation maps Sastry and Oore (2019) as the outlier score. While these methods work very well, reaching near ceiling performance, they require human annotations, which may not be available.

4 Experiments

A primary application of uncertainty is to find abnormal or anomalous inputs. We aim to show that using TaU temperatures as uncertainty is effective for out-of-distribution (OOD) detection Hendrycks and Gimpel (2016); Liang et al. (2017); Sastry and Oore (2019), with the added bonus of sacrificing little to no performance on downstream tasks.

OOD Detection

We study OOD detection, where inputs from an anomalous distribution are fed to a trained model. A well-performing metric should assign high uncertainty to these OOD inputs, thereby making it possible to classify whether an input is OOD.

We train TaU on CIFAR10 as the inlier dataset, and consider three different OOD sets: CIFAR100, SVHN, and TinyImageNet. We note that CIFAR10 and CIFAR100 are very similar in distribution, whereas SVHN is the most dissimilar. To measure performance, we compute AUROC on correctly classifying an example as OOD or not. We compare TaU to several baselines. Assuming unrestricted computational power and memory, one expensive procedure to derive uncertainties is to fit a k-nearest neighbors algorithm on the entire training corpus, and treat the average distance of a new example to

neighbors as an uncertainty score. We try out with working the best. This serves as an upper bound on performance. For other baselines, please refer to Sec. 3.

Method (CIFAR10) CIFAR100 SVHN TinyImageNet
TaU + SimCLR 0.746 0.760
TaU + MoCo-v2 0.968

SimCLR + kNN

0.746
MoCo-v2 + kNN
SimCLR + MC Dropout Gal and Ghahramani (2016)
Supervised + MC Dropout Gal and Ghahramani (2016)
Hedged Instance Embedding Oh et al. (2018)
Ensemble of 5 SimCLRs
Table 1: Downstream Out-of-Distribution Detection: comparison of TaU to several popular baselines for uncertainty on deep neural networks. Out-of-distribution AUROC is reported.

From Table 1, we observe that on CIFAR100 and TinyImageNet – two image corpora with similar content as CIFAR10 – TaU outperforms (or matches) all baselines, though only surpassing SimCLR + kNN by a small margin of 0-2%. However, for SVHN – an image corpus very different in content to CIFAR10 – TaU outperforms all baselines by at least 13%. In fact, we find most baselines do not generalize well to contrastive learning, as many perform near chance AUROC. Even prior methods specifically for contrastive uncertainty Oh et al. (2018) do not consistently perform well. The exception is using kNN with the caveat that the OOD set was not too far from the training set. Nearest neighbors fundamentally relies on a good distance function, which is achievable when the OOD input can be properly embedded. But in cases when we are truly OOD, it may not be clear where to embed an anomalous image. In these cases, as with SVHN, kNN approaches struggle.

Linear Evaluation Although we have shown that TaU uncertainties can detect OOD inputs, they would not be much good if it came at a large cost of performance on downstream tasks. To show this is not the case, we measure performance through both linear evaluation Chen et al. (2020a) and k-nearest neighbors on the training set Zhuang et al. (2019) (where the predicted label for a test example is the label of the closest example in the training set). Please refer to the appendix for experiment details.

Method kNN Eval Linear Eval
TaU + SimCLR
TaU + MoCo-v2
SimCLR
MoCo-v2
Table 2: Downstream Image Classification

: mean and standard deviation in accuracy are measured over three runs with different random seeds. The best performing models are bolded.

From Table 2, we observe TaU to perform only slightly worse than their deterministic counterparts, with a small reduction of 2-3 percentage points on both linear and k-nearest neighbors evaluation. While there is a non-zero cost to adding uncertainty, we believe that trading a few percentage points for a measure of confidence is practically worthwhile.

Uncertainty on Pretrained Models

We next show that TaU can be finetuned on top of pretrained models, enabling uncertainties to be generated post-hoc on popular off-the-shelf checkpoints. Specifically, we finetune on supervised, SimCLR, BYOL, and CLIP embeddings. All models were pretrained using ResNet50 on ImageNet with the exception of CLIP, which uses a ViT

Dosovitskiy et al. (2020)

. We finetune TaU uncertainties for 40 epochs, and all images were reshaped to 224 by 224 pixels.

Method

(ImageNet)

CIFAR10 CIFAR100 SVHN TinyImgNet LSUN COCO CelebA
TaU + Supervised 0.913 0.874 0.978 0.771 0.657 0.458 0.657
TaU + SimCLR Chen et al. (2020a) 0.823 0.870 0.968 0.747 0.552 0.554 0.717
TaU + BYOL Grill et al. (2020) 0.763 0.808 0.955 0.686 0.471 0.497 0.840
TaU + CLIP Radford et al. (2021) 0.056 0.044 0.071 0.154 0.779 0.579 0.883
Table 3: Out-of-Distribution Detection using Pretrained Embeddings: using TaU to generate uncertainties for several pretrained models. Out-of-distribution AUROC is reported.

Table 3 reports AUROC for OOD detection for a wide survey of outlier datasets. We find that for supervised, SimCLR, and BYOL embeddings, the learned TaU uncertainties are largely able to classify OOD inputs. The exception is with COCO, likely due to a close similarity with ImageNet data points. However, CLIP surprisingly faces the opposite problem with low OOD scores for most datasets but outperforming in COCO and LSUN. Further work could explore whether CLIP’s behavior is due to differences in objective, architecture, or training.

5 Limitations and Future Work

We presented TaU, a simple method for adding uncertainty into contrastive learning objectives by repurposing temperature as uncertainty. In our experiments, we compared TaU to existing benchmark algorithms and found competitive downstream performance, in addition to TaU uncertainties being useful for out-of-distribution detection. We then demonstrated how uncertainty can be added to already trained model checkpoints, enabling practitioners to reuse computation.

We discuss an important limitation: our approach is restricted to contrastive algorithms built on NCE. Other approaches, such as SimSiam Chen and He (2021), BYOL Grill et al. (2020), and Barlow Twins Zbontar et al. (2021), replace negative examples entirely with stop gradients, where we find limited success with TaU. Future work can also explore TaU-like techniques for detecting corrupted or adversarial examples.

References

  • [1] M. Abbasi, A. Rajabi, C. Gagné, and R. B. Bobba (2020) Toward adversarial robustness by diversity in an ensemble of specialized deep neural networks. In

    Canadian Conference on Artificial Intelligence

    ,
    pp. 1–14. Cited by: §3.
  • [2] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya, et al. (2021) A review of uncertainty quantification in deep learning: techniques, applications and challenges. Information Fusion. Cited by: §3.
  • [3] A. Agarwala, J. Pennington, Y. Dauphin, and S. Schoenholz (2020) Temperature check: theory and practice for training models with softmax-cross-entropy losses. arXiv preprint arXiv:2010.07344. Cited by: §3.
  • [4] P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §2.
  • [5] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In

    International conference on machine learning

    ,
    pp. 1597–1607. Cited by: Appendix A, §1, §2, §2, Table 3, §4.
  • [6] X. Chen, H. Fan, R. Girshick, and K. He (2020) Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297. Cited by: §1, §2.
  • [7] X. Chen and K. He (2021) Exploring simple siamese representation learning. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 15750–15758. Cited by: §1, §1, §5.
  • [8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §4.
  • [9] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §3, Table 1.
  • [10] J. Gawlikowski, C. R. N. Tassi, M. Ali, J. Lee, M. Humt, J. Feng, A. Kruspe, R. Triebel, P. Jung, R. Roscher, et al. (2021) A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342. Cited by: §3.
  • [11] E. Goan and C. Fookes (2020) Bayesian neural networks: an introduction and survey. In

    Case Studies in Applied Bayesian Data Science

    ,
    pp. 45–87. Cited by: §3.
  • [12] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §1, §1, Table 3, §5.
  • [13] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §3.
  • [14] M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 297–304. Cited by: §2.
  • [15] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §1, §2.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Appendix A.
  • [17] D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §3, §4.
  • [18] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §1, §2.
  • [19] S. Liang, Y. Li, and R. Srikant (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §3, §4.
  • [20] L. Liu, W. Wei, K. Chow, M. Loper, E. Gursoy, S. Truex, and Y. Wu (2019) Deep neural network ensembles against deception: ensemble diversity, accuracy and robustness. In 2019 IEEE 16th international conference on mobile ad hoc and sensor systems (MASS), pp. 274–282. Cited by: §3.
  • [21] I. Misra and L. v. d. Maaten (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: §1.
  • [22] L. Neumann, A. Zisserman, and A. Vedaldi (2018) Relaxed softmax: efficient confidence auto-calibration for safe pedestrian detection. Cited by: Appendix A, §3.
  • [23] S. J. Oh, K. Murphy, J. Pan, J. Roth, F. Schroff, and A. Gallagher (2018) Modeling uncertainty with hedged instance embedding. arXiv preprint arXiv:1810.00319. Cited by: §3, Table 1, §4.
  • [24] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.
  • [25] B. N. Oreshkin, P. Rodriguez, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. arXiv preprint arXiv:1805.10123. Cited by: §3.
  • [26] B. Poole, S. Ozair, A. Van Den Oord, A. Alemi, and G. Tucker (2019) On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171–5180. Cited by: §2.
  • [27] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: §3, Table 3.
  • [28] J. Rajasegaran, S. Khan, M. Hayat, F. S. Khan, and M. Shah (2020) Self-supervised knowledge distillation for few-shot learning. arXiv preprint arXiv:2006.09785. Cited by: §3.
  • [29] C. S. Sastry and S. Oore (2019) Detecting out-of-distribution examples with in-distribution examples and gram matrices. arXiv preprint arXiv:1912.12510. Cited by: §3, §4.
  • [30] A. Tamkin, M. Wu, and N. Goodman (2020) Viewmaker networks: learning views for unsupervised representation learning. arXiv preprint arXiv:2010.07432. Cited by: §1.
  • [31] S. Tao (2019) Deep neural network ensembles. In International Conference on Machine Learning, Optimization, and Data Science, pp. 1–12. Cited by: §3.
  • [32] Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 776–794. Cited by: §2.
  • [33] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020) What makes for good views for contrastive learning?. arXiv preprint arXiv:2005.10243. Cited by: §1, §2.
  • [34] F. Wang and H. Liu (2021) Understanding the behaviour of contrastive loss. arXiv preprint arXiv:2012.09740v2. Cited by: §1, §2.
  • [35] M. Wu, M. Mosse, C. Zhuang, D. Yamins, and N. Goodman (2020) Conditional negative sampling for contrastive learning of visual representations. arXiv preprint arXiv:2010.02037. Cited by: §1.
  • [36] M. Wu, C. Zhuang, M. Mosse, D. Yamins, and N. Goodman (2020) On mutual information in contrastive learning for visual representations. arXiv preprint arXiv:2005.13149. Cited by: §2.
  • [37] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733–3742. Cited by: §2.
  • [38] J. Xie, X. Zhan, Z. Liu, Y. S. Ong, and C. C. Loy (2020) Delving into inter-image invariance for unsupervised visual representations. arXiv preprint arXiv:2008.11702. Cited by: §1.
  • [39] Y. You, I. Gitman, and B. Ginsburg (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888. Cited by: Appendix A.
  • [40] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny (2021) Barlow twins: self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230. Cited by: §1, §1, §5.
  • [41] X. Zhang, F. X. Yu, S. Karaman, W. Zhang, and S. Chang (2018) Heated-up softmax embedding. arXiv preprint arXiv:1809.04157. Cited by: §3.
  • [42] C. Zhuang, A. L. Zhai, and D. Yamins (2019)

    Local aggregation for unsupervised learning of visual embeddings

    .
    In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6002–6012. Cited by: §2, §4.

Appendix A Training Hyperparameters

Pretraining

For all models, we use a representation dimensionality of 128. We use the LARS optimizer [39] with learning rate 1e-4, weight decay 1e-6, batch size 128 for 200 epochs, as described in [5]. For baseline models (no uncertainty), we use a fixed temperature . For MoCO-V2, we use negative samples with a momentum of 0.999 for updating the memory queue. We use the same optimizer as described above but with learning rate 1e-3. For CIFAR10, all images are resized to 32x32 pixels; For ImageNet, all images are resized to 256x256 pixels (with 224x224 cropping size). During pretraining, we use random resized crop, color jitter, random grayscale, random gaussian blur, horizontal flipping, and pixel normalization (with ImageNet statistics). During testing, we only center crop and do pixel normalization. For encoders we train from scratch, we use ResNet18 [16] for encoders

. We adapted the Pytorch Lightning Bolts implementations of SimCLR and MoCo-v2, found here:

https://github.com/PyTorchLightning/lightning-bolts.

For the larger models, we downloaded existing SimCLR (ResNet50) checkpoints trained on ImageNet from https://github.com/google-research/simclr and converted it to PyTorch checkpoints using https://github.com/Separius/SimCLRv2-Pytorch.

Downstream Classification

We freeze encoder parameters, remove the final L

normalization, and append a 2-layer MLP with hidden dimension of 128 and ReLU nonlinearity. For optimization, we use SGD with batch size 256, learning rate 1e-4, weight decay 1e-6, and cosine learning rate schedule that drops at epoch 60 and 80, with a total of 100 epochs.

Optimization Stability

When optimizing the TaU objective, we found optimization instability where would either collapse to 0 or converge to if left unbounded. We found it crucial to employ some tricks for optimization stability. First, we follow Neumann et al. and have our network predict some instead of directly [22]. This changes the training dynamics but does not change the underlying equation. Second, we bound

to between 0 and 1 using a sigmoid function. Finally, we divide

by , which helps initialize the temperature to be in the same range as fixed-temperature models. When using the uncertainty for out-of-distribution detection, we found that using the pre-sigmoid worked much better than the post-sigmoid , as the differences between post-sigmoid values became indistinguishable using float32.