Pre-Training Buys Better Robustness and Uncertainty
Tuning a pre-trained network is commonly thought to improve data efficiency. However, Kaiming He et al. have called into question the utility of pre-training by showing that training from scratch can often yield similar performance, should the model train long enough. We show that although pre-training may not improve performance on traditional classification metrics, it does provide large benefits to model robustness and uncertainty. Through extensive experiments on label corruption, class imbalance, adversarial examples, out-of-distribution detection, and confidence calibration, we demonstrate large gains from pre-training and complementary effects with task-specific methods. We show approximately a 30 label noise robustness and a 10 on CIFAR-10 and CIFAR-100. In some cases, using pre-training without task-specific methods surpasses the state-of-the-art, highlighting the importance of using pre-training when evaluating future methods on robustness and uncertainty tasks.READ FULL TEXT VIEW PDF
Generalization and robustness are both key desiderata for designing mach...
Pre-training convolutional neural networks with weakly-supervised and
In this paper, we propose a general and efficient pre-training paradigm,...
We extend our previous work on constituency parsing (Kitaev and Klein, 2...
Recent advances in language modeling have led to computationally intensi...
Relational tables on the Web store a vast amount of knowledge. Owing to ...
We propose a novel word embedding pre-training approach that exploits wr...
Pre-Training Buys Better Robustness and Uncertainty
Pre-training is a central technique in the research and applications of deep convolutional neural networks(AlexNet). In research settings, pre-training is ubiquitously applied in state-of-the-art object detection and segmentation (maskrcnn). Moreover, some researchers aim to use pre-training to create “universal representations” that transfer to multiple domains (universal). In applications, the “pre-train then tune” paradigm is commonplace, especially when data for a target task is acutely scarce (zeilerpretrain). This broadly applicable technique enables state-of-the-art model convergence.
However, hepretrain argue that model convergence is merely faster with pre-training, so that the benefit on modern research datasets is only improved wall-clock time. Surprisingly, pre-training provides no performance benefit on various tasks and architectures over training from scratch, provided the model trains for long enough. Even models trained from scratch on only 10% of the COCO dataset (coco)
attain the same performance as pre-trained models. This casts doubt on our understanding of pre-training and raises the important question of whether there are any uses for pre-training beyond tuning for extremely small datasets. They conclude that, with modern research datasets, ImageNet pre-training is not necessary.
In this work, we demonstrate that pre-training is not needless. While hepretrain are correct that models for traditional tasks such as classification perform well without pre-training, pre-training substantially improves the quality of various complementary model components. For example, we show that while accuracy may not noticeably change with pre-training, what does tremendously improve with pre-training is the model’s adversarial robustness. Furthermore, even though training for longer on clean datasets allows models without pre-training to catch up, training for longer on a corrupted dataset leads to model deterioration. And the claim that “pre-training does not necessarily help reduce overfitting” (hepretrain) is valid when measuring only model accuracy, but it becomes apparent that pre-training does reduce overfitting when also measuring model calibration. We bring clarity to the doubts raised about pre-training by showing that pre-training can improve model robustness to label corruption (Sukhbaatar), class imbalance (japkowicz2000class), and adversarial attacks (adversarial)
; it additionally improves uncertainty estimates for out-of-distribution detection(hendrycks17baseline) and calibration (oconnor), though not necessarily traditional accuracy metrics.
Pre-training yields improvements so significant that on many robustness and uncertainty tasks we surpass state-of-the-art performance. We even find that pre-training alone improves over techniques devised for a specific task. Note that experiments on these tasks typically overlook pre-training, even though pre-training is ubiquitous elsewhere. This is problematic since we find there are techniques which do not comport well with pre-training; thus some evaluations of robustness are less representative of real-world performance than previously thought. Consequently, researchers for these tasks would do well to adopt the “pre-train then tune” paradigm for greater realism and increased performance.
It is well-known that pre-training improves generalization when the dataset for the target task is extremely small. Prior work on transfer learning has analyzed the properties of this effect, such as when fine-tuning should stop(objectdetectionanalysis) and which layers should be fine-tuned (yosinski). In a series of ablation studies, Huh2016 show that the benefits of pre-training are robust to significant variation in the dataset used for pre-training, including the removal of classes related to the target task. In our work, we observe similar robustness to change in the dataset used for pre-training.
Pre-training has also been used when the dataset for the target task is large, such as Microsoft COCO (coco) for object detection and segmentation. However, in a recent work hepretrain show that pre-training merely speeds convergence on these tasks, and real gains in performance vanish if one trains from scratch for long enough, even with only 10% of the data for the target task. They conclude that pre-training is not necessary for these tasks. Moreover, Sun_2017_ICCV show that the accuracy gains from more data are exponentially diminishing, severely limiting the utility of pre-training for improving performance metrics for traditional tasks. In contrast, we show that pre-training does markedly improve model robustness and uncertainty.
Learning in the presence of corrupted labels has been well-studied. In the context of deep learning,Sukhbaatar
investigate using a stochastic matrix encoding the label noise, though they note that this matrix is difficult to estimate.Patrini propose a two-step training procedure to estimate this stochastic matrix and train a corrected classifier. These approaches are extended by hendrycks2018glc, who consider having access to a small dataset of cleanly labeled examples, leverage these trusted data to improve performance.
zhang2018 show that networks overfit to the incorrect labels when trained for too long (Figure 1). This observation suggests pre-training as a potential fix, since one need only fine-tune for a short period to attain good performance. We show that pre-training not only improves performance with no label noise correction, but also complements methods proposed in prior work. Also note that most prior works (goldberger2016training; ma2018dimensionality; han2018co) only experiment with small-scale images since label corruption demonstrations can require training hundreds of models (hendrycks2018glc). Since pre-training is typically reserved for large-scale datasets, such works do not explore the impact of pre-training.
To handle class imbalance, many training strategies have been investigated in the literature. One direction is rebalancing an imbalanced training dataset. To this end, he2008learning propose to remove samples from the majority classes, while huang2016learning
replicate samples from the minority classes. Generating synthetic samples through linear interpolation between data samples belonging in the same minority class has been studied inchawla2002smote
. An alternative approach is to modify the supervised loss function. Cost sensitive learning(japkowicz2000class) balances the loss function by re-weighting each sample by the inverse frequency of its class. huang2016learning and dong2018imbalanced demonstrate that enlarging the margin of a classifier helps mitigate the class imbalance problem. However, adopting such training methods often incurs various time and memory costs.
The susceptibility of neural networks to small, adversarially chosen input perturbations has received much attention. Over the years, many methods have been proposed as defenses against adversarial examples (defense; defense2), but these are often circumvented in short order (bypass). In fact, the only defense widely regarded as having stood the test of time is the adversarial training procedure of madry. In this algorithm, white-box adversarial examples are created at each step of training and substituted in place of normal examples. This does provide some amount of adversarial robustness, but it requires substantially longer training times. In a later work, madrydata argue further progress on this problem may require significantly more task-specific data. However, given that data from a different distribution can be beneficial for a given task (Huh2016), it is conceivable that the need for task-specific data could be obviated with pre-training.
Even though deep networks have achieved high accuracy on many classification tasks, measuring the uncertainty in their predictions remains a challenging problem. Obtaining well-calibrated predictive uncertainty could be useful in many machine learning applications such as medicine or autonomous vehicles. Uncertainty estimates need to be useful for detecting out-of-distribution samples.hendrycks17baseline propose out-of-distribution detection tasks and use the maximum value of a classifier’s softmax distribution as a baseline method. mahal propose Mahalanobis distance-based scores which characterize out-of-distribution samples using hidden features. kimin propose using a GAN (gans) to generate out-of-distribution samples; the network is taught to assign low confidence to these GAN-generated samples. hendrycks2019oe
demonstrate that using non-specific, real, and diverse outlier images or text in place of GAN-generated samples can allow classifiers and density estimators to improve their out-of-distribution detection performance and calibration.kilian show that contemporary networks can easily become miscalibrated without additional regularization, and we show pre-training can provide useful regularization.
Datasets. For the following robustness experiments, we evaluate on CIFAR-10 and CIFAR-100 (krizhevsky2009learning). These datasets contain color images, both with 60,000 images split into 50,000 for training and 10,000 for testing. CIFAR-10 and CIFAR-100 have 10 and 100 classes, respectively. For pre-training, we use Downsampled ImageNet (DownsampledImageNet), which is the 1,000-class ImageNet dataset (imagenet) resized to resolution. For ablation experiments, we remove 153 CIFAR-10-related classes from the Downsampled ImageNet dataset. In this paper we tune the entire network. Code is available at github.com/hendrycks/pre-training.
|Normal Training||Pre-Training||Normal Training||Pre-Training|
|GLC (5% Trusted)||14.0||7.2||46.8||33.7|
|GLC (10% Trusted)||11.5||6.4||38.9||28.4|
Setup. In the task of classification under label corruption, the goal is to learn as good a classifier as possible on a dataset with corrupted labels. In accordance with prior work (Sukhbaatar) we focus on multi-class classification. Let , , and be an input, clean label, and potentially corrupted label respectively. The labels take values from to . Given a dataset of pairs with drawn from and drawn from , the task is to predict .
To experiment with a variety of corruption severities, we corrupt the true label with a given probability to a randomly chosen incorrect class. Formally, we generate corrupted labels with a ground truth matrix of corruption probabilities , where is the probability of corrupting an example with label to label . Given a corruption strength , we construct with , the identity matrix. To measure performance, we use the area under the curve plotting test error against corruption strength. This is generated via linear interpolation between test errors at corruption strengths from to in increments of , summarizing a total of 11 experiments.
Methods. We first consider the baseline of training from scratch. This is denoted as Normal Training in Table 1. We also consider state-of-the-art methods for classification under label noise. The Forward method of Patrini uses a two-stage training procedure. The first stage estimates the matrix describing the expected label noise, and the second stage trains a corrected classifier to predict the clean label distribution. We also consider the Gold Loss Correction (GLC) method of hendrycks2018glc, which assumes access to a small, trusted dataset of cleanly labeled (gold standard) examples, which is also known as a semi-verified setting (semiverified). This method also attempts to estimate . For this method, we specify the “trusted fraction,” which is the fraction of the available training data that is trusted or known to be cleanly labeled.
In all experiments, we use 40-2 Wide Residual Networks, SGD with Nesterov momentum, and a cosine learning rate schedule(sgdr). The “Normal” experiments train for 100 epochs with a learning rate of and use dropout at a drop rate of , as in wideresnet. The experiments with pre-training train for 10 epochs without dropout, and use a learning rate of in the “No Correction” experiment and
in the experiments with label noise corrections. We found the latter experiments required a larger learning rate because of variance introduced by the stochastic matrix corrections. Most parameter and architecture choices recur in later sections of this paper. Results are inTable 1.
Analysis. In all experiments, pre-training gives large performance gains over the models trained from scratch. With no correction, we see a 45% relative reduction in the area under the error curve on CIFAR-10 and a 29% reduction on CIFAR-100. These improvements exceed those of the task-specific Forward method. Therefore in the setting without trusted data, pre-training attains new state-of-the-art AUCs of 15.9% and 39.1% on CIFAR-10 and CIFAR-100 respectively. These results are stable, since pre-training on Downsampled ImageNet with CIFAR-10-related classes removed yields a similar AUC on CIFAR-10 of 14.5%. Moreover, we found that these gains could not be bought by simply training for longer. As shown in Figure 1, training for a long time with corrupted labels actually harms performance as the network destructively memorizes the misinformation in the incorrect labels.
We also observe complementary gains of combining pre-training with previously proposed label noise correction methods. In particular, using pre-training together with the GLC on CIFAR-10 at a trusted fraction of 5% cuts the area under the error curve in half. Moreover, using pre-training with the same amount of trusted data provides larger performance boosts than doubling the amount of trusted data, effectively allowing one to reach a target performance level with half as much trusted data. Qualitatively, Figure 2 shows that pre-training softens the performance degradation as the corruption strength increases.
Importantly, although pre-training does have significant additive effects on performance with the Forward Correction method, we find that pre-training with no correction yields superior performance. That is, when evaluating methods with pre-training enabled, the Forward Correction performs worse than the baseline of no correction. This observation is significant, because it implies that future research on this problem should evaluate with pre-trained networks or else researchers may develop methods that are suboptimal.
|Total Test Error Rate / Minority Test Error Rate (%)|
|CIFAR-10||Normal Training||23.7 / 26.0||21.8 / 26.5||21.1 / 25.8||20.3 / 24.7||20.0 / 24.5||18.3 / 23.1||15.8 / 20.2|
|Cost Sensitive||22.6 / 24.9||21.8 / 26.2||21.1 / 25.7||20.2 / 24.3||20.2 / 24.6||18.1 / 22.9||16.0 / 20.1|
|Oversampling||21.0 / 23.1||19.4 / 23.6||19.0 / 23.2||18.2 / 22.2||18.3 / 22.4||17.3 / 22.2||15.3 / 19.8|
|SMOTE||19.7 / 21.7||19.7 / 24.0||19.2 / 23.4||19.2 / 23.4||18.1 / 22.1||17.2 / 22.1||15.7 / 20.4|
|Pre-Training||8.0 / 8.8||7.9 / 9.5||7.6 / 9.2||8.0 / 9.7||7.4 / 9.1||7.4 / 9.5||7.2 / 9.4|
|CIFAR-100||Normal Training||69.7 / 72.0||66.6 / 70.5||63.2 / 69.2||58.7 / 65.1||57.2 / 64.4||50.2 / 59.7||47.0 / 57.1|
|Cost Sensitive||67.6 / 70.6||66.5 / 70.4||62.2 / 68.1||60.5 / 66.9||57.1 / 64.0||50.6 / 59.6||46.5 / 56.7|
|Oversampling||62.4 / 66.2||59.7 / 63.8||59.2 / 65.5||55.3 / 61.7||54.6 / 62.2||49.4 / 59.0||46.6 / 56.9|
|SMOTE||57.4 / 61.0||56.2 / 60.3||54.4 / 60.2||52.8 / 59.7||51.3 / 58.4||48.5 / 57.9||45.8 / 56.3|
|Pre-Training||37.8 / 41.8||36.9 / 41.3||36.2 / 41.7||36.4 / 42.3||34.9 / 41.5||34.0 / 41.9||33.5 / 42.2|
In most real-world classification problems, some classes are more abundant than others, which naturally results in class imbalance (van2018inaturalist). Unfortunately, deep networks tend to model prevalent classes at the expense of minority classes. This need not be the case. Deep networks are capable of learning both the prevalent and minority classes, but to accomplish this, task-specific approaches have been necessary. In this section, we show that pre-training can also be useful for handling such imbalanced scenarios better than approaches specifically created for this task (japkowicz2000class; chawla2002smote; huang2016learning; dong2018imbalanced).
Setup. Similar to dong2018imbalanced, we simulate class imbalance with a power law model. Specifically, we set the number of training samples for a class as follows: , where represents an imbalance ratio, and are the largest and smallest class size. Thus there are fewer samples for classes with greater class indices. Our training data becomes a power law class distribution as the imbalance ratio decreases. We test 7 different degrees of imbalance; specifically, and are set to for CIFAR-10 and for CIFAR-100. A class is defined as a minority class if its size is smaller than the average class size. For evaluation, we measure the average test set error rates of all classes and error rates of minority classes.
Methods. The class imbalance baseline methods are as follows. Normal Training is the conventional approach of training from scratch with cross-entropy loss. Oversampling (japkowicz2000class) is a re-sampling method to build a balanced training set before learning through augmenting the samples of minority classes with random replication. SMOTE (chawla2002smote) is an oversampling method that uses synthetic samples by interpolating linearly with neighbors. Cost Sensitive (huang2016learning) introduces additional weights in the loss function for each class proportional to inverse class frequency.
Here we use 40-2 Wide Residual Networks, SGD with Nesterov momentum, and a cosine learning rate schedule. The experiments with pre-training train for 50 epochs without dropout and use a learning rate of , and the experiments with other baselines train for 100 epochs with a learning rate of and use dropout at a drop rate of .
Analysis. Table 2 shows that the pre-training alone significantly improves the test set error rates compared to task-specific methods that can incur expensive back-and-forth costs, requiring additional training time and memory. Here, we remark that much of the gain from pre-training is from the low test error rates on minority classes (i.e., those with greater class indices), as shown in Figure 3. Furthermore, if we tune a network on CIFAR-10 that is pre-trained on Downsampled ImageNet with CIFAR-10-related classes removed, the total error rate increases by only 2.1% compared to pre-training on all classes. By contrast, the difference between pre-training and SMOTE is 12.6%. This implies that pre-training is indeed useful for improving robustness against class imbalance.
|Adv. Pre-Training and Tuning||87.1||57.4||59.2||33.5|
Setup. Deep networks are notably unstable and less robust than the human visual system (geirhos; hendrycks2019robustness). For example, a network may produce a correct prediction for a clean image, but should the image be perturbed carefully, its verdict may change entirely (adversarial). This has led researchers to defend networks against “adversarial” noise with a small norm, so that networks correctly generalize to images with a worst-case perturbation applied.
Nearly all adversarial defenses have been broken (bypass), and adversarial robustness for large-scale image classifiers remains elusive (alpbroken). The exception is that adversarial training has been partially successful for defending small-scale image classifiers against perturbations. After approximately 1.5 years, the adversarial training procedure of madry remains the state-of-the-art defense. Following their work and using their adversarial training procedure, we experiment with CIFAR images and assume the adversary can corrupt images with perturbations of an norm less than or equal to . The initial learning rate is and the learning rate anneals following a cosine learning rate schedule. We adversarially train the model against a 10-step adversary for 100 epochs and test against 20-step untargeted adversaries. Unless otherwise specified, we use 28-10 Wide Residual Networks, as adversarially trained high-capacity networks exhibit greater adversarial robustness (kurakin; madry).
Analysis. It could be reasonable to expect that pre-training would not improve adversarial robustness. First, nearly all adversarial defenses fail, and even some adversarial training methods can fail too (alpbroken). Current adversarial defenses result in networks with large generalization gaps, even when the train and test distributions are similar. For instance, CIFAR-10 Wide ResNets are made so wide that their adversarial train accuracies are 100% but their adversarial test accuracies are only 45.8%. madrydata speculate that a significant increase in task-specific data is necessary to close this gap. This generalization gap swells under slight changes to the problem setup (l1madry). We attempt to reduce this gap and make pre-trained representations transfer across data distributions, but doing so requires an unconventional choice. Choosing to use targeted adversaries or no adversaries during pre-training does not provide substantial robustness. Instead, we choose to adversarially pre-train a Downsampled ImageNet model against an untargeted adversary, contra kurakin; alp; adverkaiming.
We find that an adversarially pre-trained network can surpass the long-standing state-of-the-art model by a significant margin. By pre-training a Downsampled ImageNet classifier against an untargeted adversary, then adversarially fine-tuning on CIFAR-10 or CIFAR-100 for 5 epochs with a learning rate of , we obtain networks which improve adversarial robustness by 11.6% and 9.2% in absolute accuracy respectively.
As in the other tasks we consider, a Downsampled ImageNet model with CIFAR-10-related classes removed sees similar robustness gains. As a quick check, we pre-trained and tuned two 40-2 Wide ResNets, one pre-trained typically and one pre-trained with CIFAR-10-related classes excluded from Downsampled ImageNet. We observed only a 1.04% decrease in adversarial accuracy compared to the typically pre-trained model, which demonstrates that the pre-trained models do not rely on seeing CIFAR-10-related images, and that simply training on more natural images increases adversarial robustness. Notice that in Table 3 the clean accuracy is approximately the same while the adversarial accuracy is far larger. This indicates again that pre-training may have a limited effect on accuracy for traditional tasks, but it has a strong effect on robustness.
It is even the case that the pre-trained representations can transfer to a new task without adversarially tuning the entire network. In point of fact, if we only adversarially tune the last affine classification layer, and no other parameters, for CIFAR-10 and CIFAR-100 we respectively obtain adversarial accuracies of 46.6% and 26.1%. Thus adversarially tuning only the last affine layer also surpasses the previous adversarial accuracy state-of-the-art. This further demonstrates that that adversarial features can robustly transfer across data distributions. In addition to robustness gains, adversarial pre-training could save much wall-clock time since pre-training speeds up convergence; compared to typical training routines, adversarial training prohibitively requires at leastthe usual amount of training time. By surpassing the previous state-of-the-art, we have shown that pre-training enhances adversarial robustness.
To demonstrate that pre-training improves model uncertainty estimates, we use the CIFAR-10, CIFAR-100, and Tiny ImageNet datasets (tiny_imagenet). We did not use Tiny ImageNet in the robustness section, because adversarial training is not known to work on images of this size, and using Tiny ImageNet is computationally prohibitive for the label corruption experiments. Tiny ImageNet consists of 200 ImageNet classes at resolution, so we use a version of Downsampled ImageNet for pre-training. We also remove the 200 overlapping Tiny ImageNet classes from Downsampled ImageNet for all experiments on Tiny ImageNet.
In all experiments, we use 40-2 Wide ResNets trained using SGD with Nesterov momentum and a cosine learning rate. Pre-trained networks train on Downsampled ImageNet for 100 epochs, and are fine-tuned for 10 epochs for CIFAR and 20 for Tiny ImageNet without dropout and with a learning rate of . Baseline networks train from scratch for 100 epochs with a dropout rate of . When performing temperature tuning in Section 4.2, we train without 10% of the training data to estimate the optimum temperature.
Setup. In the problem of out-of-distribution detection (hendrycks17baseline; hendrycks2019oe; kimin; mahal; pacanomaly), models are tasked with assigning anomaly scores to indicate whether a sample is in- or out-of-distribution. hendrycks17baseline show that the discriminative features learned by a classifier are well-suited for this task. They use the maximum softmax probability for each sample as a way to rank in- and out-of-distribution (OOD) samples. OOD samples tend to have lower maximum softmax probabilities. Improving over this baseline is a difficult challenge without assuming knowledge of the test distribution of anomalies (oodnotestknowledge). Without assuming such knowledge, we use the maximum softmax probabilities to score anomalies and show that models which are pre-trained then tuned provide superior anomaly scores.
To measure the quality of out-of-distribution detection, we employ two standard metrics. The first is the AUROC
, or the Area Under the Receiver Operating Characteristic curve. This is the probability that an OOD example is assigned a higher anomaly score than an in-distribution example. Thus a higher AUROC is better. A similar measure is theAUPR, or the Area Under the Precision-Recall Curve; as before, a higher AUPR is better. For in-distribution data we use the test dataset. For out-of-distribution data we use the various anomalous distributions from hendrycks2019oe, including Gaussian noise, textures, Places365 scene images (zhou2017places), etc. All OOD datasets do not have samples from Downsampled ImageNet. Further evaluation details are in the Supplementary Materials.
Analysis. By using pre-training, both the AUROC and AUPR consistently improve over the baseline, as shown in Table 4. Note that results are an average of the AUROC and AUPR values from detecting samples from various OOD datasets. Observe that with pre-training, CIFAR-100 OOD detection significantly improves. Consequently pre-training can directly improve uncertainty estimates.
|RMS Error||MAD Error|
Setup. A central component of uncertainty estimation in classification problems is confidence calibration. From a classification system that produces probabilistic confidence estimates of its predictions being correct, we would like trustworthy estimates. That is, when a classifier predicts a class with eighty percent confidence, we would like it to be correct eighty percent of the time. oconnor; hendrycks17baseline found that deep neural network classifiers display severe overconfidence in their predictions, and that the problem becomes worse with increased representational capacity (kilian). Integrating uncalibrated classifiers into decision-making processes could result in egregious assessments, motivating the task of confidence calibration.
To measure the calibration of a classifier, we adopt two measures from the literature. The Root Mean Square Calibration Error (RMS) is the square root of the expected squared difference between the classifier’s confidence and its accuracy at said confidence level, . The Mean Absolute Value Calibration Error (MAD) uses the expected absolute difference rather than squared difference between the same quantities. The MAD Calibration Error has the same form as the Expected Calibration Error used by kilian, but it employs adaptive binning of confidences for improved estimation. In our experiments, we use a bin size of 100. We refer the reader to hendrycks2019oe for further details on these measures.
Analysis. Figure 4 shows that as the classifier trains for longer periods of time, the network becomes overconfident and less calibrated, yet pre-training enables fast convergence and can side-step this problem. In all experiments, we observe large improvements in calibration from using pre-training. In Figure 5 and Table 5, we can see that RMS Calibration Error is at least halved on all datasets through the use of pre-training, with CIFAR-100 seeing the largest improvement. The same is true of the MAD error. In fact, the MAD error on CIFAR-100 is reduced by a factor of 4.1 with pre-training, which can be interpreted as the stated confidence being four times closer to the true frequency of occurrence.
We find that these calibration gains are complementary with the temperature tuning method of kilian, which further reduces RMS Calibration Error from 4.15 to 3.55 for Tiny ImageNet when combined with pre-training. However, temperature tuning is computationally expensive and requires additional data, whereas pre-training does not require collecting extra data and can naturally and directly make the model more calibrated.
Although hepretrain assert that pre-training does not improve performance on traditional tasks, for other tasks this is not so. On robustness and uncertainty tasks, pre-training results in models that surpass the previous state-of-the-art. For uncertainty tasks, we find pre-trained representations directly translate to improvements in predictive uncertainty estimates. hepretrain argue that both pre-training and training from scratch result in models of similar accuracy, but we show this only holds for unperturbed data. In fact, pre-training with an untargeted adversary surpasses the long-standing state-of-the-art in adversarial accuracy by a significant margin. Robustness to label corruption is similarly improved by wide margins, such that pre-training alone outperforms certain task-specific methods, sometimes even after combining these methods with pre-training. This suggests future work on model robustness should evaluate proposed methods with pre-training in order to correctly gauge their utility, and some work could specialize pre-training for these downstream tasks. In sum, the benefits of pre-training extend beyond merely quick convergence, as previously thought, since pre-training can improve model robustness and uncertainty.