Self-Supervised Pretraining Improves Self-Supervised Pretraining

03/23/2021 ∙ by Colorado J. Reed, et al. ∙ 45

While self-supervised pretraining has proven beneficial for many computer vision tasks, it requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation. Prior work demonstrates that models pretrained on datasets dissimilar to their target data, such as chest X-ray models trained on ImageNet, underperform models trained from scratch. Users that lack the resources to pretrain must use existing models with lower performance. This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model. Through experimentation on 16 diverse vision datasets, we show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data. Taken together, HPT provides a simple framework for obtaining better pretrained representations with less computational resources.



There are no comments yet.


page 5

page 6

page 16

Code Repositories


Repository providing a wide range of self-supervised pretrained models for computer vision tasks.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, self-supervised pretraining – an unsupervised pretraining method that self-labels data to learn salient feature representations – has outperformed supervised pretraining in an increasing number of computer vision applications [6, 8, 5]. These advances come from instance contrastive learning, where a model is trained to identify visually augmented images that originated from the same image from a set [15, 64]. Typically, self-supervised pretraining uses unlabeled source data to pretrain a network that will be transferred to a supervised training process on a target dataset. Self-supervised pretraining is particularly useful when labeling is costly, such as in medical and satellite imaging [56, 9].

Figure 1: Methods of using self-supervision. The top row are the two common prior approaches to using self-supervised (SS) pretraining. In Generalist Pretraining, a large, general, base dataset is used for pretraining, e.g. ImageNet. In Specialist Pretraining, a large, specialized source dataset is collected and used for pretraining, e.g. aerial images. In this paper, we explore Hierarchical Pre-Training (HPT), which sequentially pretrains on datasets that are similar to the target data, thus providing the improved performance of specialist pretraining while leveraging existing generalist models.

However, self-supervised pretraining requires long training time on large datasets, e.g. SimCLR [6] showed improved performance out to epochs on ImageNet’s 1.2 million images [54]. In addition, instance contrastive learning is sensitive to the choice of data augmentation policies and many trials are often required to determine good augmentations [51, 65].

The computational intensity and sensitivity of self-superivsed pretraining may lead researchers to seek self-supervised models from model zoos and research repositories. However, models pretrained on domain-specific datasets are not commonly available. In turn, many practitioners do not use a model pretrained on data similar to their target data, but instead, use a pretrained, publicly available model trained on a large, general dataset, such as ImageNet. We refer to this process as generalist pretraining. A growing body of research indicates that pretraining on domain-specific datasets, which we refer to as specialist pretraining, leads to improved transfer performance [49, 38, 42].

Figure 1 formalizes this categorization of self-supervised pretraining methods. Generalist and specialist pretraining are as described above, with one round of self-supervised pretraining on a domain-general and domain-specific dataset, respectively. Hierarchical Pretraining refers to models pretrained on datasets that are progressively more similar to the target data. HPT first pretrains on a domain-general dataset (referred to as the base pretrain), then optionally pretrains on domain-specific datasets (referred to as the source pretrain), before finally pretraining on the target dataset (referred to as the target pretrain). In all cases, pretraining is followed by supervised finetuning on the target task.

Specialist pretraining presents the same core challenge that transfer learning helps alleviate: a sensitive training process that requires large datasets and significant computational resources

[30]. While transfer learning has been carefully investigated in supervised and semi-supervised settings for computer vision [58]

, it has not been formally studied for self-supervised pretraining, itself. Furthermore, several recent papers that apply self-supervised learning to domain-specific problems did not apply transfer learning to the pretraining process itself, which motivated our work 

[2, 60, 1, 32].

In this paper, we investigate the HPT framework with a diverse set of pretraining procedures and downstream tasks. We test 16 datasets spanning visual domains, such as medical, aerial, driving, and simulated images. In our empirical study, we observe that HPT shows the following benefits compared to self-supervised pretraining from scratch:

  • HPT reduces self-supervised pretraining convergence time up to 80.

  • HPT consistently converges to better performing representations than generalist or specialist pretraining for 15 of the 16 studied datasets on image classification, object detection, and semantic segmentation tasks.

  • HPT is significantly more resilient to the set of image augmentations and amount of data used during self-supervised pretraining.

In the following sections, we discuss the relevant background for our investigation, formalize our experimental settings, present the results and ablations, and include a discussion of the results and their implications and impact on future work. Based on the presented analyses, we provide a set of guidelines for practitioners to successfully apply self-supervised pretraining to new datasets and downstream applications. Finally, in the appendix, we provide many additional experiments that generalize our results to include supervised pretraining models. In summary, across datasets, metrics, and methods, self-supervised pretraining improves self-supervised pretraining.

2 Background and Related Work

Transfer learning studies how a larger, more general, or more specialized source dataset can be leveraged to improve performance on target downstream datasets/tasks [50, 47, 3, 11, 25, 22, 14, 16, 71, 19, 34, 48]. This paper focuses on a common type of transfer learning in which model weights trained on source data are used to initialize training on the target task [69]. Model performance generally scales with source dataset size and the similarity between the source and target data [49, 38, 42].

A fundamental challenge for transfer learning is to improve the performance on target data when it is not similar to source data. Many papers have tried to increase performance when the target and source datasets are not similar. Recently, [46] proposed first training on the base dataset and then training with subsets of the base dataset to create specialist models, and finally using the target data to select the best specialist model. Similarly, [40] used target data to reweight the importance of base data. Unlike these works, we do not revisit the base data, modify the pretrained architecture, or require expert model selection or a reweighting strategy.

Self-supervised pretraining is a form of unsupervised training that captures the intrinsic patterns and properties of the data without using human-provided labels to learn discriminative representations for the downstream tasks [12, 13, 73, 18, 62]. In this work we focus on a type of self-supervised pretraining called instance contrastive learning [15, 64, 22], which trains a network by determining which visually augmented images originated from the same image, when contrasted with augmented images originating from different images. Instance contrastive learning has recently outperformed supervised pretraining on a variety of transfer tasks [22, 7], which has lead to increased adoption in many applications. Specifically, we use the MoCo algorithm [8] due to its popularity, available code base, reproducible results without multi-TPU core systems, and similarity to other self-supervised algorithms [33]. We also explore additional self-supervised methods in the appendix.

Our focus is on self-supervised learning for vision tasks. Progressive self-supervised pretraining on multiple datasets has also been explored for NLP tasks, e.g.  see [21, 45] and the citations within. In [21], the authors compare NLP generalist models and NLP models trained on additional source and task-specific data. While our work is similar in spirit to the language work of [21], our work focuses on computer vision, includes a greater variation of pretraining pipelines and methods, and allows for adaptation with fewer parameter updates.

Label-efficient learning includes weak supervision methods [37] that assume access to imperfect but related labels, and semi-supervised methods that assume labels are only available for a subset of available examples [7, 29, 66]. While some of the evaluations of the learned representations are done in a semi-supervised manner, HPT is complementary to these approaches and the representations learned from HPT can be used in conjunction with them.

3 Hierarchical pretraining

HPT sequentially performs a small amount of self-supervised pretraining on data that is increasingly similar to the target dataset. In this section, we formalize each of the HPT components as depicted in Figure 1.

Base pretraining: We use the term base pretraining to describe the initial pretraining step where a large, general vision dataset (base dataset) is used to pretrain a model from scratch. Practically, few users will need to perform base pretraining, and instead, can use publicly available pretrained models, such as ImageNet models. Because base pretraining, like many prior transfer learning approaches, is domain agnostic, most practitioners will select the highest performing model on a task with a large domain [28].

Source pretraining: Given a base trained model, we select a source dataset that is both larger than the target dataset and more similar to the target dataset than the base dataset. Many existing works have explored techniques to select a model or dataset that is ideal for transfer learning with a target task [52]. Here, we adopt an approach studied by [30, 52] in a supervised context called a task-aware search strategy: each potential source dataset is used to perform self-supervised pretraining on top of the base model for a very short amount of pretraining, e.g. 5k pretraining steps as discussed in Section 4. The supervised target data is then used to train a linear evaluator on the frozen pretrained source model. The source model is then taken to be the model that produces the highest linear evaluation score on the target data, and is then used for additional target pretraining.

Experimentally, we have found that using a single, similar, and relatively large (e.g.  images) source dataset consistently improves representations for the target task. Furthermore, we view source pretraining as an optional step, and as shown in Section 4, HPT still leads to improved results when directly performing self-supervised pretraining on the target dataset following the base pretraining. We further discuss source model selection in the appendix.

Target pretraining: Finally, we perform self-supervised pretraining with the target dataset, initialized with the final source model, or the base model in the case when no source model was used. This is also the stage where layers of the model can be frozen to prevent overfitting to the target data and enable faster convergence speed. Experimentally, we have found that freezing all parameters except the modulation parameters of the batch norm layers leads to consistently strong performance for downstream tasks, particularly when the target dataset is relatively small ( images).

Supervised finetune: Given the self-supervised pretrained model on the target dataset, we transfer the final model to the downstream target task, e.g. image classification or object detection.

Figure 2: Linear separability evaluation

. For each of the 16 datasets, we train a generalist model for 800 epochs on ImageNet (Base). We either train the whole model from 50-50k iters (HPT Base-Target) or just the batch norm parameters for 5k iters (HPT Base-Target (BN)). We compare HPT to a Specialist model trained from a random initialization (Target). For each, we train a linear layer on top of the final representation. HPT obtains the best results on 15 out of 16 datasets without hyperparameter tuning.

4 Experiments

Through the following experiments, we investigate the quality, convergence, and robustness of self-supervised pretraining using the HPT framework.

4.1 Datasets

We explored self-supervised pretraining on the following datasets that span several visual domains (see the appendix for all details). Dataset splits are listed with a train/val/test format in square brackets after the dataset description.

Aerial: xView [31] is a 36-class object-centric, multi-label aerial imagery dataset []. RESISC [9]

is a 45-class scene classification dataset for remote sensing [

]. UC-Merced [68] is a 21-class aerial imagery dataset [].

Autonomous Driving: BDD [70] is a high resolution driving dataset with object detection labels and weather classification labels. We evaluate HPT performance over the object detection task, as well as the weather classification task []. VIPER [53] is a 23-class simulated driving dataset for which we perform multi-label each object in the image [].

Medical: Chexpert [26] is a large, multi-label X-ray dataset, where we determine whether each image has any of 5 conditions []. Chest-X-ray-kids [27] provides pediatric X-rays used for 4-way pneumonia classification [].

Natural, Multi-object: COCO-2014 [35] is an 81-class object detection benchmark. We perform multi-label classification for each object, and we further use the split to perform object detection and segmentation []. Pascal VOC 2007+2012 [17] is a standard 21-class object detection benchmark we use for multi-label classification to predict whether each object is in each image. We also use the object detection labels for an object detection transfer task [].

Assorted: DomainNet [44] contains six distinct datasets, where each contains the same 345 categories. The domains consist of real images similar to ImageNet, sketch images of greyscale sketches, painting images, clipart images, quickdraw images of binary black-and-white drawings from internet users, and infograph illustrations. We use the original train/test splits with 20% of the training data used for validation. Oxford Flowers [41]

: we use the standard split to classify 102 fine-grain flower categories [


[5pt]     [5pt]     [5pt]     [5pt]
[5pt]     [5pt]     [5pt]     [5pt]
[5pt]     [5pt]     [5pt]     [5pt]
[5pt]     [5pt]     [5pt]     [5pt]

Figure 3: Semi-supervised evaluation. We compared the best semi-supervised finetuning performance from the (B)ase model, (T)arget pretrained model, HPT pretrained model, and HPT-BN pretrained model using a 1k labeled subset of each dataset. Despite performing 10x-80x less pretraining, HPT consistently outperformed the Base and Target. HPT-BN generally showed improvement over Base model transfer, but did not surpass HPT’s performance.

4.2 Evaluations

The features of self-supervised pretrained models are typically evaluated using one of the following criteria:

  • Separability: Tests whether a linear model can distinguish different classes in a dataset using learned features. Good representations should be linearly separable [43, 10].

  • Transferability: Tests the performance of the model when finetuned on new datasets and tasks. Better representations will generalize to more downstream datasets tasks [22].

  • Semi-supervised: Test performance with limited labels. Better representations will suffer less performance degradation [25, 6].

We explored these evaluation methods with each of the above datasets. For all evaluations, unless otherwise noted, we used a single, centered crop of the test data with no test-time augmentations. For classification tasks, we used top-1 accuracy and for multi-label classification tasks we used the Area Under the ROC (AUROC) [4].

In our experiments, we used MoCo-V2 [8]

as the self-supervised training algorithm. We selected MoCo-V2 as it has state-of-the-art or comparable performance for many transfer tasks, and because it uses the InfoNCE loss function 

[43], which is at the core of many recent contrastive pretraining algorithms [36]. Unless otherwise noted, all training is performed with a standard ResNet-50 backbone [57] on 4 GPUs, using default training parameters from [22]. We also explored additional self-supervised pretraining algorithms and hyperparameters in the appendix.

In the following experiments, we compare implementations of the following self-supervised pretraining strategies:

  • Base: transfers the 800-epoch MoCo-V2 ImageNet model from [8]

    and also updates the batch norm’s non-trainable mean and variance parameters using the target dataset (this uniformly led to slightly improved performance for Base transfer).

  • Target: performs MoCo-V2 on the target dataset from scratch.

  • HPT: initializes MoCo-V2 pretraining with the 800-epoch MoCo-V2 ImageNet model from [8], then optionally performs pretraining on a source dataset before pretraining on the target dataset. The batch norm variant (HPT-BN) only trains the batch norm parameters (), e.g. a ResNet-50 has 25.6M parameters, where only are BN parameters.

Existing work largely relies on supervised evaluations to tune the pretraining hyperparameters [6], but in practice, it is not possible to use supervised evaluations of unlabeled data to tune the hyperparameters. Therefore, to emphasize the practicality of HPT, we used the default pretraining hyperparameters from [8] with a batch size of 256 (see the appendix for full details).

4.3 Pretraining Quality Analysis

Figure 4: Full finetuning evaluations. Finetuning performance on target datasets. For these datasets, we evaluated the performance increase on the target dataset by pretraining on sequences of (B)ase (ImageNet), (S)ource (left) dataset, and (T)arget (right) dataset. All HPT variants beat all baselines in all cases, with HPT-BN getting slightly better performance on UC Merced and B+S+T having the best performance elsewhere.

Separability analysis: We first analyzed the quality of the learned representations through a linear separability evaluation [6]. We trained the linear model with a batch size of 512 and the highest performing learning rate of . Similar to [30], we used steps rather than epochs to allow for direct computational comparison across datasets. For Target pretraining, we pretrained for {5k, 50k, 100k, 200k, 400k} steps, where we only performed 400k steps if there was an improvement between 100k and 200k steps. For reference, one NVIDIA P100 GPU-Day is 25k steps. We pretrained HPT for much shorter schedules of {50, 500, 5k, 50k} steps, and HPT-BN for 5k steps – we observed little change in performance for HPT-BN after 5k steps.

Key observations: From Figure 2, we observe that HPT typically converges by 5k steps of pretraining regardless of the target dataset size, and that for 15 out of 16 datasets, HPT and HPT-BN converged to models that performed as well or better than the Base transfer or Target pretraining at 400k steps (80x longer). The only dataset in which the Target pretraining outperformed HPT was quickdraw – a large, binary image dataset of crowd-sourced drawings. We note that quickdraw is the only dataset in Target pretraining at 5k steps outperformed directly transferring the Base model, indicating that the direct transfer performance from ImageNet is quite poor due to a large domain gap – an observation further supported by its relatively poor domain adaptation in [44].

HPT improved performance on RESISC, VIPER, BDD, Flowers, xView, and clipart, infograph, and sketch: a diverse range of image domains and types. HPT had similar performance as Base transfer for the datasets that were most similar to ImageNet: real, COCO-2014, and Pascal, as well as for UC-Merced, which had 98.2% accuracy for Base transfer and 99.0% accuracy for HPT and HPT-BN. The two medical datasets, Chexpert and Chest-X-ray-kids had comparable performance with HPT and Target pretraining, yet HPT reached equivalent performance in 5k steps compared to 200k and 100k, respectively. Finally, HPT exhibited overfitting characteristics after 5k steps, where the overfitting was more pronounced on the smaller datasets (UC-Merced, Flowers, Chest-X-ray-kids, Pascal), leading us to recommend a very short HPT pretraining schedule, e.g. 5k iterations, regardless of dataset size. We further investigate these overfitting characteristics in the appendix.

Semi-supervised transferability: Next, we conducted a semi-supervised transferability evaluation of the pretrained models. This experiment tested whether the benefit from the additional pretraining is nullified when finetuning all model parameters. Specifically, we selected the top performing models from the linear analysis for each pretraining strategy and fully finetuned the pretrained models using 1000 randomly selected labels without class balance but such that each class occured at least once. We finetune using a combination of two learning rates (0.01, 0.001) and two finetuning schedules (2500 steps, 90 epochs) with a batch size of 512 and report the top result for each dataset and model – see the appendix for all details.

Key observations: Figure 3 shows the top finetuning performance for each pretraining strategy. The striped bars show the HPT pretraining variants, and we observe that similar to the linear analysis, HPT has the best performing pretrained models on 15 out of 16 datasets, with quickdraw being the exception. One key observation from this experiment is that HPT is beneficial in the semi-supervised settings and that the representational differences from HPT and the Base model are different enough that full model finetuning cannot account for the change. We further note that while HPT-BN outperformed HPT in several linear analyses, HPT-BN never outperformed HPT when finetuning all parameters. This result indicates that some of the benefit from pretraining only the batch norm parameters is redundant with supervised finetuning. We also note that whether Base or Target pretraining performed better is highly dependent on the dataset, while HPT had uniformly strong performance.

[5pt]   [5pt]   [5pt]

Figure 5: Augmentation robustness. We compare the accuracy change of sequentially removing data augmentation policies (Grayscale, ColorJitter, RandomHorizontalFlip. GaussianBlur) on linear evaluation performance. HPT performs better with only cropping than any other policy does with any incomplete combination.

Sequential pretraining transferability: Here, we explore HPT’s performance when pretraining on a source dataset before pretraining on the target dataset and finally transferring to the target task. We examined three diverse target datasets: Chest-X-ray-kids, sketch, and UC-Merced. We select the source dataset for each of the target dataset by choosing the source dataset that yielded the highest linear evaluation accuracy on the target dataset after 5k pretraining steps on top of the base model. This selection yielded the following HPT instantiations: ImageNet then Chexpert then Chest-X-ray-kids, ImageNet then clipart then sketch, and ImageNet then RESISC then UC-Merced.

Key observations: Figure 4 compares finetuning the 1000-label subset of the target data after the following pretraining strategies: directly using the Base model (B), Target pretraining (T), Base then Source pretraining (B+S), Base then Target pretraining (B+T), Base then Source pretraining then Target pretraining (B+S+T), and Base then Source pretraining then Target pretraining on the batch norm parameters (B+S+T-BN). The full HPT pipeline (B+S+T) leads to the top results on all three target datasets. In the appendix, we further show that the impact of an intermediate source model decreases with the size of the target dataset.

Object detection and segmentation transferability: For Pascal and BDD, we transferred HPT pretrained models to a Faster R-CNN R50-C4 model and finetuned the full model; for COCO, we used a Mask-RCNN-C4. Over three runs, we report the median results using the COCO AP metric as well as AP/AP. For Pascal, we performed finetuning on the train2007+2012 set and performed evaluation on the test2007 set. For BDD we used the provided train/test split, with 10k random images in the train split used for validation. For COCO, we used the 2017 splits and trained with the 1x schedule (see appendix for all details).

Key observations: Tables 1-2 show the object detection and segmentation results. For Pascal, we tested HPT instantiations of Base-Target, Base-Target (BN), and Base-Source-Target, where COCO-2014 was selected as the source model using the top-linear-analysis selection criteria. For the larger BDD and COCO datasets, we tested Base-Target and Base-Target (BN). Overall, the results are consistent across all datasets for image classification, object detection, and segmentation: HPT: both Base-Target and Base-Target (BN) lead to improvements over directly transferring the Base model to the target task.

Pascal VOC07
Target 48.4 75.9 51.9
Base 57.0 82.5 63.6
HPT: Base-Target 57.1 82.7 63.7
HPT: Base-Target (BN) 57.5 82.8 64.0
HPT: Base-Source-Target 57.5 82.7 64.4
HPT: Base-Source-Target (BN) 57.6 82.9 64.2
Target 24.3 46.9 24.0
Base 27.1 48.7 25.4
HPT: Base-Target 28.1 50.0 26.3
HPT: Base-Target (BN) 28.0 49.6 26.3
Table 1: Transfer Result: This table reports the median AP, , over three runs of finetuning a Faster-RCNN C4 detector. For Pascal, the Source dataset is COCO-2014. A bold result indicates a improvement over all other pretraining strategies.

The Base-Source-Target Pascal results show an improvement when pretraining all model parameters, but remain consistent when only pretraining the batch norm parameters. This indicates that while the batch norm parameters can find a better pretraining model, sequentially pretraining from the source to the target on these values does not always yield an improved result. Across datasets, the overall gains are relatively modest, but we view these results as an indication that HPT is not directly learning redundant information with either the MoCo pretraining on ImageNet or the finetuning task on the target dataset. Furthermore, it is surprising that only tuning the batch norm parameters on the target dataset leads to an improvement in object detection. From this result, we note that pretraining specific subsets of object detector backbone parameters may provide a promising direction for future work.

Target 36.0 54.7 38.6 19.3 40.6 49.1
Base 38.0 57.4 41.3 20.7 43.3 51.4
HPT: B-T 38.4 58.0 41.3 21.6 43.5 52.2
HPT: B-T (BN) 38.2 57.4 40.9 20.6 43.4 52.2
Table 2: Transfer Result: This table reports the median AP, , over three runs of finetuning a Mask-RCNN-C4 detector on COCO-2017. A bold result indicates at least a improvement over all other pretraining strategies.
Figure 6: HPT performance as the amount of pretraining data decreases. We show the linear evaluation performance as the amount of pretraining data varies. Top axis is the number of images, and the bottom is the percentage of pretraining data. HPT outperforms Base model transfer or Target pretraining with limited data. Notably, HPT-BN consistently outperforms Target pretraining with only 1% of images.

4.4 HPT Robustness

Here, we investigate the robustness of HPT to common factors that impact the effectiveness of self-supervised pretraining such as the augmentation policy [6, 51] and pretraining dataset size [39]. For these robustness experiments, we used the BDD, RESISC, and Chexpert datasets as they provided a diversity in data domain and size. We measured separability, with the same hyperparameters as in §4.2.

Augmentation robustness: MoCo-V2 sequentially applies the following image augmentations: RandomResizedCrop, ColorJitter, Grayscale, GaussianBlur, RandomHorizontalFlip. We studied the robustness of HPT by systematically removing these augmentations and evaluating the change in the linear evaluation for HPT and Target pretraining.

Key observations: Figure 5 shows separability results across datasets after sequentially removing augmentations. In all three data domains, HPT maintained strong performance compared to Target pretraining. Unlike BDD and RESISC, the Chexpert performance decreased as the augmentation policy changed. This illustrates that changes to the augmentation policy can still impact performance when using HPT, but that the overall performance is more robust. In turn, as a practitioner explores a new data domain or application, they can either use default augmentations directly or choose a conservative set, e.g.  only cropping.

Pretraining data robustness: We pretrained with {1%, 10%, 25%, 100%} of the target dataset. For HPT we used 5k pretraining steps. For other methods with 25% or 100% of the data, we used the same number of steps as the top performing result in Figure 2. With 1% or 10% of the data, we use of the steps.

Key observations: Figure 6 shows separability results. CheXpert has 3x more training data than BDD, which in turn has 3x more training data than Resisc. While more data always performed better, the accuracy improvements of HPT increased as the amount of pretraining data decreased. HPT-BN, while not achieving as high performance as HPT in all cases, had minimal accuracy degradation in low data regimes. It consistently outperformed other methods with 5k samples.

4.5 Domain Adaptation Case Study

In this section, we explore the utility of HPT through a realistic case study experiment in which we apply HPT in a domain adaptation context. Specifically, in this experiment, the goal was to perform image classification on an unseen target domain given a labeled set of data in the source domain. We assume the target labels are scarcely provided with as few as 1 per class to 68 (see Table 3). We use a modern semi-supervised domain adaptation method called Minimax Entropy (MME) [55]

which consists of a feature encoder backbone, followed by a cosine similarity based classification layer that computes the features’ similarity with respect to a set of prototypes estimated for each class. Adaptation is achieved by adversarially maximizing the conditional entropy of the unlabeled target data with respect to the classifier and minimizing it with respect to the feature encoder.

The training procedure is as follows: we performed HPT to train a model using both source and target datasets on top of the standard MSRA ImageNet model [24]. We used this model to initialize the feature encoder in MME. At the end of each budget level we evaluated accuracy on the entire test set from the target domain. We perform two experiments on DomainNet datasets [44] with 345 classes in 7 budget levels with increasing amount of target labels: (i) from real to clip and (ii) from real to sketch. We use EfficientNet_B2 [59] as the backbone architecture.

Table 3 shows our results for both domain adaptation experiments using MME with and without HPT. From the results, we observe that HPT consistently outperforms the baseline on both domains by achieving a higher accuracy across all the budget levels. On the extreme case of low data regime (one shot/class), HPT achieves nearly 8% better accuracy in both clipart and sketch domains in the extreme case of providing one shot per class in the target domain. This gap shrinks to 2% as we increase the number of labeled target samples to 68 shots per class which is equivalent to 23,603 samples. These results demonstrate the effectiveness of HPT when applied as a single component in a realistic, end-to-end inference system.

Budget levels in target domains
# of shots per class 1 11 16 22 32 46 68
# of samples 345 3795 5470 7883 11362 16376 23603
Test accuracy (%) for realclip
MME 49.74 61.11 63.87 66.68 68.01 69.99 71.09
MME+HPT 57.15 64.36 66.67 68.20 69.66 71.47 72.35
Test accuracy (%) for realsketch
MME 41.35 51.78 54.90 57.51 59.70 61.36 62.45
MME+HPT 50.17 56.43 58.77 60.72 62.80 63.91 64.90
Table 3: Budget levels and test accuracy in target domain for semi-supervised domain adaptation at 7 budget levels using MME with and without HPT between realclip and realsketch. At the single shot/class budget level, HPT achieves nearly 8% better accuracy in both clipart and sketch domains. This gap shrinks to 2% as we increase the number of labeled target samples to 68 shots per class which is equivalent to 23,603 samples.

5 Discussion

We have shown that HPT achieves faster convergence, improved performance, and increased robustness, and that these results hold across data domains. Here, we further reflect on the utility of the HPT.

What is novel about HPT? The transfer learning methodology underlying HPT is well established in transfer learning. That is, transfer learning tends to work in a lot of situations, and our work could be perceived as a natural extension of this general observation. However, our work provides the first thorough empirical analysis of transfer learning applied to self-supervised pretraining in computer vision. We hope this analysis encourages practitioners to include an HPT baseline in their investigations – a baseline that is surprisingly absent from current works.

How should I use HPT in practice? We provide our code, documentation, and models to use HPT and reproduce our results222Code and pretrained models are available at For existing codebases, using HPT is usually as simple as downloading an existing model and updating a configuration. If working with a smaller dataset (e.g. k images), our analysis indicates that using HPT-BN is ideal.

Does this work for supervised learning? Yes. In the appendix, we reproduce many of these analyses using supervised ImageNet base models and show that HPT further improves performance across datasets and tasks.

6 Conclusion and Implications

Our work provides the first empirical analysis of transfer learning applied to self-supervised pretraining for computer vision tasks. In our experiments, we have observed that HPT resulted in 80x faster convergence, improved accuracy, and increased robustness for the pretraining process. These results hold across data domains, including aerial, medical, autonomous driving, and simulation. Critically HPT requires fewer data and computational resources than prior methods, enabling wider adoption of self-supervised pretraining for real-world applications. Pragmatically, our results are easy to implement and use: we achieved strong results without optimizing hyperparameters or augmentation policies for each dataset. Taken together, HPT is a simple framework that improves self-supervised pretraining while decreasing resource requirements.

Funding Acknowledgements

Prof. Darrell’s group was supported in part by DoD, NSF as well as BAIR and BDD at Berkeley, and Prof. Keutzer’s group was supported in part by Alibaba, Amazon, Google, Facebook, Intel, and Samsung as well as BAIR and BDD at Berkeley.


7 Implementation details

Table 4 lists the parameters used in the various training stages of the HPT pipeline. When possible, we followed existing settings from [8]. For the finetuning parameter sweeps, we followed a similar setting as the “lightweight sweep” setting from [72]. We performed pretraining with the train and val splits. For evaluation, we used only the train split for training the evaluation task and then use the val split evaluation performance to select the top hyperparameter, training schedule, and evaluation point during the training. We then reported the performance on the test split evaluated with the best settings found with the val split.

For the linear analysis and finetuning experiments, we used RandomResizedCrop to 224 pixels and RandomHorizontalFlip augmentations (for more on these augmentations, see [8]) during training. During evaluation, we resized the long edge of the image to 256 pixels and used a center crop on the image. All images were normalized by their individual dataset’s channel-wise mean and variance. For classification tasks, we used top-1 accuracy and for multi-label classification tasks we used the Area Under the ROC (AUROC) [4].

For the 1000-label semi-supervised finetuning experiments, we randomly selected 1000 examples from the training set to use for end-to-end finetuneing of all layers, where each class occurred at least once, but the classes were not balanced. Similar to [72], we used the original validation and test splits to improve evaluation consistency.

For all object detection experiments, we used the R50-C4 available in Detectron2 [63], where following [22], the backbone ends at conv4 and the box prediction head consists of conv5 using global pooling followed by an additional batchnorm layer. For PASCAL object detection experiments, we used the train 2007+2012 split for training and the val2012 split for evaluation. We used 24K training steps with a batch size of 16 and all hyperparameters the same as [22]. For BDD, we used the 70K BDD train split for training and 10K val split for evaluation. We used 90K training steps with a batch size of 8 on 4 GPUs. For CoCo object detection and segmentation, we used the 2017 splits, with the 112 epochs training schedule with a training batch size of 8 images over 180K iterations on 4 GPUs half of the default learning rate (note: many results in the literature (e.g. [22]) use a batch size of 16 images over 90K iterations on 8 GPUs with the full default learning rate, which leads to slightly improved results (0.1-0.5 AP). For semantic segmentation, we used Mask-RCNN [23] with a C4 backbone setting as in [22].

Parameter MoCo-V2 Value Linear Value Finetune Value
batch size 256 512 256
num gpus 4 4 4
lr 0.03 {0.3, 3, 30} {0.001, 0.01}
schedule cosine 10x decay at 10x decay at
optimizer SGD SGD SGD
optimizer momentum 0.9 0.9 0.9
weight decay 1e-4 0.0 0.0
duration 800 epochs 5000 steps {2500 steps, 90 epochs}
moco-dim 128 - -
moco-k 65536 - -
moco-m 0.999 - -
moco-t 0.2 - -
Table 4: This table provides the parameters that were used for pretraining, linear, and finetuning analyses carried out in this paper (unless otherwise noted). Multiple values in curly braces indicate that all combinations of values were tested, i.e. in order to find an appropriate evaluation setting. 10x decay at and corresponds to decaying the learning rate by a factor of 10 after and of training steps have occurred, respectively.

8 Datasets

Table 5 lists the datasets used throughout our experiments. For all evaluations, unless otherwise noted, we used top-1 accuracy for the single classification datasets and used the Area Under the ROC (AUROC) [4] for multi-label classification tasks.

Dataset Train/Validation/Test Size Labels Classification Type
BDD [70] 60K/10K/10K 6 classes singular
Chest-X-ray-kids[27] 4186/1046/624 4 classes singular
Chexpert [26] 178.7K/44.6K/234 5 classes multi-class
Coco-2014 [35] 82.7K/20.2K/20.2K 80 classes multi-class
Clipart 27.2K/6.8K/14.8K 345 classes singular
Infograph 29.6K/7.4K/16.1K 345 classes singular
Domain Net Painting 42.2K/10.5K/22.8K 345 classes singular
[44] Quickdraw 96.6K/24.1K/51.7K 345 classes singular
Real 98K/24.5K/52.7K 345 classes singular
Sketch 39.2K/9.8K/21.2K 345 classes singular
RESISC [9] 18.9K/6.3K/6.3K 45 classes singular
VIPER [53] 13.3K/2.8K/4.9K 5 classes singular
UC Merced [68] 1.2K/420/420 21 classes singular
Pascal VOC [17] 13.2K/3.3K/4.9K 20 classes multi-class
Flowers [41] 1K/1K/6.1K 103 classes singular
xView [31] 39K/2.8K/2.8K 36 classes multi-class
Table 5: Dataset Descriptions. We use to denote train/val/test split in each dataset.

9 Additional Experiments and Ablations

9.1 HPT Learning Rate

We investigated the choice of the learning rate used during HPT and its effect on linear evaluation performance (see Table 6). Specifically, we tested initial learning rates {0.1, 0.03, and 0.001}, on datasets: RESISC, BDD, and Chexpert. The following table shows that the default learning rate of 0.03 for batch size 256 from [8] outperformed the other configurations. Based on this experiment, we used the default 0.03 learning rate for all HPT pretraining runs.

Learning Rate
Datasets 0.1 0.03 0.001
RESISC 92.5 93.7 91.6
BDD 81.9 83.2 82.4
Chexpert 79.5 85.8 83.9
Table 6: The following table shows the linear evaluation performance with HPT pretraining learning rates of {0.1, 0.03, and 0.001}. Based on this experiment, we continues to use the default 0.03 learning rate from [8].

9.2 HPT with Supervised Base Model

We explored using a supervised ImageNet base model from [24] instead of the self-supervised MoCo model from [8]. Similar to the expepriments shown in Figure 2 in the main paper, Figure 7 shows the same results using a supervised base ImageNet model. We observe similar behavior as with the self-supervised base model: HPT with the supervised base model tends to lead to improved results compared to directly transferring the base model or pretraining entirely on the target data. Unlike the self-supervised base model, HPT with a supervised base model often shows improved performance after 5K iterations, e.g. at 50K iterations (RESISC, BDD, DomainNet Sketch, DomainNet Quickdraw, xView, Chexpert, CoCo-2014), indicating that the supervised base model needs longer training to obtain comparable linear evaluation results. Also unlike the self-supervised base model, these results show clearly better Target model results for BDD and Chexpert, and a larger gap with DomainNet Quickdraw. This indicates that the supervised pretraining is less beneficial as a base model when the domain gap is large – an observation further supported by the experiments in [67].

In Figure 8, we show results similar to the finetuning experiment results displayed in Figure 4 in the main paper, we investigate the finetuning performance on the same set of target datasets except using a supervised ImageNet base model [24]. Overall, HPT again leads to improved performance over finetuning on the Target pretrained models for all three datasets. Different from the self-supervised base model used in Figure 4, all results for the DomainNet framework are considerably worse, and incorporating training on the source dataset (DomainNet Clipart) does not demonstrate an improvement in this case. Overall, however, HPT with the supervised base model leads to improved finetuning performance with and without the source training step.

Figure 7: Linear eval: For each of the 16 datasets, we use a supervised ImageNet Base model [24]. We train the HPT framework for 50-50k iterations (HPT Base(sup)-Target). We compare it to a model trained from a random initialization (Target) trained from 5K-400K iterations. For each, we train a linear layer on top of the final representation. With a supervised base model, HPT obtains as good or better results on 13/16 datasets without hyperparameter tuning.
Figure 8: Finetuning performance on target datasets with supervised pretrainings. Here, we show results similar to the finetuning experiment results displayed in Figure 4, we investigated the finetuning performance on the same set of target datasets except using a supervised ImageNet base model [24] and supervised (S)ource pretraining for B+S+T. Overall, HPT again leads to improved performance over finetuning on the Target pretrained models for all three datasets. Different from the self-supervised base model used in Figure 4, all results for the Domain Net framework are considerably worse, and incorporating training on the source dataset (Domain Net Clipart) does not demonstrate an improvement in this case. Overall, however, HPT with the supervised base model leads to improved finetuning performance with and without the source training step.

Augmentation robustness: We studied the augmentation robustness of HPT when using a supervised base model (HPT-sup). We followed the same experimental procedure described in Section 4.4. Figure 9 shows the results using HPT-sup, while for comparison, Figure 5 shows the results with HPT. Both HPT-sup and HPT demonstrate their robustness to the set of augmentations used while pretraining on RESISC. However, HPT-sup exhibits more variation to the augmentations used during pretraining on BDD and Chexpert. The supervised model was trained with only cropping and flipping augmentations, while the self-supervised pretraining took place with all augmentations in the shown policy. The robustness of the self-supervised base model indicates that the selection of the augmentation policy for further pretraining with the target dataset is resilient to changes in the set of augmentations used for pretraining the base model, and if these augmentations are not present, then the HPT framework loses its augmentation policy robustness.

Figure 9: Supervised base model augmentation robustness. Here, we further studied the augmentation robustness of HPT when using a supervised base model (HPT-sup). We followed the same experimental procedure described in Section 4.4. As discussed in the text, these results show that HPT-sup exhibits more variation to the augmentations used during pretraining on BDD and Chexpert. As the supervised model was only trained with cropping and flipping augmentations, this indicates that the robustness from the base augmentations used in the self-supervised pretraining remain when performing further pretraining on the target dataset.

9.3 Basetrain Robustness

Figure 10: These figures shows performance of linear evaluation on models pretrained on ImageNet for various epochs (in blue) and with addition HPT training (in red). The best baseline model pretrained only on target data for at least the equivalent of 20 ImageNet epochs and 5K HPT steps is show as the orange dotted-line.

We explored how the linear analysis evaluation changed with varying the amount of self-supervised pretraining on the base model, e.g. the initial ImageNet model. We tested base models trained for 20, 200, and 800 epochs, where the 200 and 800 epoch models were downloaded from the research repository from [8]333, and the 20 epoch model was created using their provided code and exact training settings. For each base model, we performed further pretraining on the target dataset for 5000 steps.

Figure 10 shows the results for the RESISC, BDD, Chexpert, Chest-X-ray-kids, and DomainNet Quickdraw datasets. We note several characteristics of these plots: the 200 and 800 epoch datasets performed comparably across all datasets except Chest-X-ray-kids which displayed a drop in performance at 200 epochs, indicating that the extra self-supervised pretraining needed to obtain state-of-the-art linear ImageNet classification performance is typically not be necessary for strong HPT performance. Surprisingly, BDD, Quickdraw, and Chexpert show similar or improved performance at 20 epochs of basetraining. This indicates that even a relatively small amount of self-supervised pretraining at the base level improves transfer performance. Furthermore, as mentioned in §4, the Quickdraw dataset has a large domain gap with ImageNet, and indeed, we observe that directly transferring ImageNet models with less base training leads to improved results on Quickdraw, but HPT maintains consistent performance regardless of the amount of basetraining.

The computation needed for 20 epochs of basetraining + 5000 iterations of pretraining on the target dataset is approximately equal to 100,000 iterations of pretraining on only the target dataset. For all datasets except Chest-X-ray-kids, HPT at 20 epochs of basetraining exceeded the best Target-only pretraining performance, which was k iterations for all datasets. Indeed, for RESISC, the HPT results at 20 epochs are worse than 200 and 800 epochs, but they still exceed the best Target-only pretraining results (the dashed, orange line).

9.4 Source Pretraining

[5pt]     [5pt]     [5pt]

Figure 11: These figures show the change in the linear evaluation results by adding the source pretraining for each of the target data amounts for the given datasets. We observed that for all three datasets, adding the source pretraining had a larger benefit as the amount of target data was reduced. In other words, these results show that the impact of pretraining on an intermediate source dataset decreases as the size of the target dataset increases.

In this ablation, we investigated the impact of the source pretraining stage in the HPT framework as the amount of target data changes. Intuitively, we expected that the source pretraining stage would have less impact as the amount of target data increased. For this ablation, we used the three HPT frameworks studied in §4: ImageNet (base) then Chexpert (source) then Chest-X-ray-kids (target), ImageNet (base) then DomainNet Clipart (source) then DomainNet Sketch (target), and ImageNet (base) then RESISC (source) then UC-Merced (target). For each framework, we pretrained with {1%,10%,25%,100%} of the target data on top of the base+source model and on top of only the source model, before performing a linear evaluation with the target data.

Figure 11 shows the change in the linear evaluation results by adding the source pretraining for each of the target data amounts. We observed that for all three datasets, adding the source pretraining had a larger benefit as the amount of target data was reduced. In other words, these results show that the impact of pretraining on an intermediate source dataset decreases as the size of the target dataset increases.

9.5 ResNet-18 Experiments

Figure 12: This figure shows the HPT linear evaluation performance on BDD, Chexpert, and RESISC using a ResNet-18. For these datasets, we observe similar behavior as with ResNet-50 though the evaluation performance is lower for all datasets and HPT shows improved performance at 50K iterations for all datasets. Generalizing from this experiment, we expect HPT to be broadly applicable across architectures, and we will report additional, ongoing results in our online code repository.

Similar to Figure 2, Figure 12 shows the same results using a ResNet-18 on BDD, Chexpert, and RESISC. For these datasets, we observe similar behavior as with ResNet-50, though the evaluation performance is lower. Generalizing from this experiment, we expect HPT to be broadly applicable across architectures, and we will report additional, community results in our online code repository.

9.6 BYOL Experiments

All of the pretraining results in the main paper are based on MoCo, here we use BYOL [20] for pretraining and perform linear evaluation on RESISC, BDD and Chexpert. As shown in Figure 13, we observe similar results as Figure 2 in the main paper. Generalizing from this experiment, we expect HPT to be broadly applicable across different Self-supervised pretraining methods, and we will report additional, community results in our online code repository.

Figure 13: Linear eval with BYOL [20] pretraining on RESISC, BDD and Chexpert. For each dataset, we we train a generalist model for 200 epochs on ImageNet (Base). We then train the whole model from 50-50K iterations (HPT: Base (BYOL)-Target). We compare the HPT model with a model trained from a random initialization on the target data (Target). We use a linear evaluation to evaluate the quality of the learned representations.

9.7 Representational Similarity

We examine how the similarity of representations change during pretraining with the self-supervised base model HPT, supervised base model HPT, and self-supervised training on only the target data.

9.7.1 Defining metrics

We explore different metrics for measuring the similarity of two different pretrained models. The first metric we explore is the Intersection over Union (IoU) of misclassified images. The IoU is the ratio between the number of images misclassified by both models and the number of images misclassified by at least one model.

0:  data, labels, modelA, modelB
1:  predictionA = modelA(data)
2:  predictionB = modelB(data)
3:  commonErrors, totalErrors
4:  for all  zip(labels, predictionA, predictionB) do
5:     if  and  then
6:        commonErrors+=1
7:     end if
8:     if  or  then
9:        totalErrors+=1
10:     end if
11:  end forreturn:
Algorithm 1 IoU

The activation similarity metric we used wass RV2 [61] which, instead of comparing predictions, aims to compare the similarity between two different layers’ outputs computed on the same data. The pseudocode for our algorithm is shown in Algorithm 2.

0:  activation A, activation B {Both activations are size , where is the number of data points, and is the size of the layer output}
Algorithm 2 RV2

Because many of our evaluations were performed by finetuning a linear layer over the outputs of the final convolutional layer of a pretrained model, we evaluated activations for the final convolutional layer and the linear layer finetuned on top of it.

For this analysis, we studied Resisc, UC Merced, Domain Net, Chexpert, Chest-X-ray-kids, and BDD datasets.

9.7.2 Effect of model initialization?

In this section, we present the results of two sets of experiments intended to analyze the representation of HPT models initialized from different base models. Overall, we found that HPT models with different base models learn different representations.

Random effects: In order to examine the effect of the random seed on model errors, we trained each combination of (basetrain, pretrain steps) five times for the RESISC dataset.

Representations which share the same basetrain and number of pretrain steps result in more similar errors and have much more similar representations than other combinations, see Figure 14. The IoU is typically between 0.5 and 0.6, meaning that roughly half of the total error caused by mispredictions are unique between these runs. The similarity is generally much higher, but is varies depending on the dataset.

Figure 14: Supervised and semi-supervised models trained with random seeds. Left IoU, right final linear layer similarity. Top supervised, bottom semi-supervised

Same basetrain: Out of three different basetrain configurations (random-initialized, supervised basetrain, and self-supervised basetrain), different runs starting from the same basetrain are typically more similar to each other by both metrics than those with different basetrains. This holds true across models trained for different number of iterations with different random seeds, and further adds to the notion that models learn different representations based on their starting point. However, after overfitting, models are less likely to follow this trend and become dissimilar to ones with the same basetrain. We believe this is most likely due to models learning representations that mainly distinguish input data and thus all become dissimilar to the best performing model.

We determined how similar models were using our two metrics of layer similarity and IoU errors of models with different basetrains and number of pretrain steps compared to the best performing model. All models used the same pretrain dataset and target dataset for evaluation. We focused on similarity to the highest performing model (in terms of both basetrain and training iterations) to see if different basetrains converged to the same representation.

Linear classification layers from the same basetrain are consistently more similar than those with different basetrains. This trend becomes less consistent after around 50,000 iterations of training, which is also when the self-supervised models we examined start overfitting. In Figure 15 we plot the similarity of linear layers for each model relative to the best performing models on four sample datasets.

Figure 15: Linear Classification Layer comparison. For all the following graphs, SPT refers to Supervised Pretrain with ImageNet, and SSPT refers to Self-Supervised Pretrain with MoCo. The best model is the model that attains 1.0 similarity, and every other model is compared to that point. Up until the target model’s pretrain steps (before overfitting), we can see that the similarity between the linear classification layers of every model with a different basetrain is much less than the models with the same basetrains.

This same observation held when comparing the similarity of the final convolutional layers instead of the linear classification layers as shown in Figure 16. Overall, the convolutional layers trained from the same basetrain were more similar to each other than other basetrains. There were just a few points of comparison that deviated from the trend in a little under half of the datasets we tested.

Figure 16: Final Convolutional Layer comparison. We can see that the models with basetrains similar to the best model consistently have higher similarity scores.

The IoU error comparisons in Figure 17 showed a similar trend to the linear layers, with models with the same basetrain being more similar on almost all random seeds and datasets until 50,000 iterations.

Figure 17: IoU comparison. IoU scores are consistently larger for models with the same basetrain as the best model compared to those of different basetrains, indicating that more similar errors are made by models with a similar initialization.

Finally, we performed a significance test to demonstrate the significant difference between representations learned from different basetrain models. We trained five pairs of models with identical hyperparameters, but with different basetrains (supervised vs self-supervised) and a different random seed. We also trained five pairs of models with identical hyperparameters, but with the same basetrain (self-supervised) and a different random seed. All models used ImageNet for the basetrain and RESISC for the pretrain. We calculated the linear layer similarity and IoU for each pair of models, and performed a Welsh’s t-test on the results. We found that the similarities and IoUs were significantly different. The different basetrains had a mean similarity of 0.78 while the identical basetrains had a mean similarity of 0.98 (

). The different basetrains had a mean IoU of 0.40 while the identical basetrains had a mean IoU of 0.61 ().