Despite recent successes, many modern computer vision methods rely heavily on large scale labeled datasets, which are often costly and time-consuming to collect [mahajan2018exploring, he2019momentum, chen2020improved]. Alternatives to reducing dependency on large scale labelled data include pre-training a network on the publicly available ImageNet dataset with labels [deng2009imagenet]. It has been shown that ImageNet features can transfer well to many different target tasks [huh2016makes, xie2015transfer, shin2016deep, hendrycks2019using, kornblith2019better]
. Another alternative, unsupervised learning, has received tremendous attention recently with the availability of extremely large scale data with no labels, as such data is costly to obtain[mahajan2018exploring]
. It has been shown that recent unsupervised learning methods, e.g. contrastive learning, can perform on par with their supervised learning counterparts[he2019momentum, he2020momentum, grill2020bootstrap, caron2020unsupervised, chen2020simple, chen2020improved]. Additionally, it has been shown that unsupervised learning methods perform better than pre-training on ImageNet on various downstream tasks [he2019momentum, he2020momentum, chen2020improved, sheehan2018learning, uzkent2019learning]
The explosion of data quantity and improvement of unsupervised learning portends that the standard approach in future tasks will be to (1) learn weights a on a very large scale dataset with unsupervised learning and (2) finetune the weights on a small-scale target dataset. A major problem with this approach is the large amount of computational resources required to train a network on a very large scale dataset [mahajan2018exploring]. For example, a recent contrastive learning method, MoCo-v2 [he2020momentum, he2019momentum], uses 8 Nvidia-V100 GPUs to train on ImageNet-1k for 53 hours, which can cost thousands of dollars. Extrapolating, this forebodes pre-training costs on the order of millions of dollars when considering much larger-scale datasets. Those without access to such resources will require selecting relevant subsets of those datasets. However, other studies that perform conditional filtering, such as [Yan_2020_CVPR, cui2018large, ngiam2018domain, ge2017borrowing], do not take efficiency into account at all.
Cognizant of these pressing issues, we first investigate the use of low resolution images for pre-training, which improves efficiency by with minimal performance loss. We also propose novel methods to efficiently filter a user defined number of pre-training images conditioned on a target dataset. Our approach consistently outperforms baselines by up to of accuracy/average precision. Our methods are flexible, translating to both supervised and unsupervised settings, and adaptable, translating to a wide range of target tasks. Encouragingly, their focus on features, not labels, is accentuated in the more relevant unsupervised setting, where pre-training on a small fraction of data achieves close to full pre-training target task performance.
2 Related Work
The goal in active learning is to fit a function by selectively querying labels for samples where the function is currently uncertain. In a basic setup, the samples assigned the highest entropies are chosen for annotation[wang2016cost, gal2017deep, beluch2018power, sener2017active]. The model is iteratively updated with these samples and accordingly selects new samples. Active learning typically assumes similar data distributions for candidate samples, whereas our data distributions can potentially have large shifts. Furthermore, active learning, due to its iterative nature, can be quite costly, hard to tune, and can require prior distributions [park2012bayesian].
Unconditional Transfer Learning
The success of deep learning on datasets with increased sample complexity has brought transfer learning to the attention of the research community. Pre-training networks on ImageNet-1k has been shown to be a very effective way of initializing weights for a target task with small sample size[huh2016makes, xie2015transfer, shin2016deep, hendrycks2019using, kornblith2019better, uzkent2019ijcai, uzkent2018tracking]. However, all these studies use unconditional pre-training as they employ the weights pre-trained on the full ImageNet dataset for any target task, and, as mentioned, full pre-training on large scale data could be prohibitively costly.
Conditional Transfer Learning [Yan_2020_CVPR, cui2018large, ngiam2018domain], on the other hand, filter the pre-training dataset conditioned on target tasks. In particular, [cui2018large, ge2017borrowing] use greedy class-specific clustering based on feature representations of target dataset images. To learn image representations, they use an encoder trained on the massive JFT-300M dataset [hinton2015distilling]. [Yan_2020_CVPR] trains a number of expert models on subsets of the pre-training dataset. Source images are filtered if used in an expert with good target task performance, and this method is also quite computationally expensive.
These methods are often more costly than full pre-training, defeating the purpose of our premise, and many cannot flexibly adapt to unlabelled source data or non-classification target tasks. Our methods differ from past work as we prioritize efficiency and flexibility of filtering in addition to performance and adaptability on the target task and we show strong results on all of these goals.
3 Problem Definition and Setup
We assume a target task dataset represented as where represents a set of M images with their ground truth labels . Our goal is to train a function parameterized by on the dataset to learn . One strategy is randomly initialized weights for , but a better recipe exists for a small size dataset . In this case, we first pre-train on a large scale source dataset and finetune on . This strategy not only reduces the amount of labeled samples needed in but also boosts the accuracy in comparison to the randomly initialized weights [mahajan2018exploring, uzkent2019learning]. For the pre-training dataset, we can have both labelled and unlabelled setups: (1) and (2) where . The most common example of the labelled setup is the ImageNet dataset [deng2009imagenet]. However, it is tough to label vast amounts of publicly available images, and with the increasing popularity of unsupervised learning methods[chen2020improved, chen2020mocov2, chen2020simple, he2019momentum, he2020momentum], it is easy to see that unsupervised pre-training on very large with no ground-truth labels will be the standard and preferred practice in the future.
A major problem with learning on a very large scale dataset is the computational cost, and using the whole dataset may be impossible for most. One way to reduce costs is to filter out images deemed less relevant for to create a dataset where represents a filtered version of with . Our approach conditions the filtering step on the target dataset . In this study, we propose flexible and adaptable methods to perform efficient conditional pre-training, which reduces the computational costs of pre-training and maintains high performance on the target task.
We investigate a variety of methods to perform efficient pre-training while attempting to minimize accuracy loss on the target dataset. We visualize our overall technique in Figure 1 and explain our techniques below.
4.1 Conditional Data Filtering
We propose two novel methods to perform conditional filtering efficiently. Our methods score every image in the source domain and select the best scoring images according to a pre-specified data budget . Our methods are fast, as they require at most one forward pass through to get the filtered dataset and furthermore can work on both and . The fact that we consider data features not labels perfectly lends our methods to the more relevant unsupervised setting. This is in contrast to previous work such as [cui2018large, ge2017borrowing, ngiam2018domain] which do not consider efficiency and are designed primarily for the supervised setting and thus will be more difficult for most to apply to large scale datasets.
4.1.1 Conditional Filtering by Clustering
Selecting an appropriate subset of pre-training data can be viewed as selecting a set of data that minimizes the Earth Mover Distance (or 1-Wasserstein distance) between and the target dataset , as explored in [cui2018large, ge2017borrowing]. This is accomplished by taking feature representations of the set of images and selecting pre-training image classes which are close (by some distance metric) to the representations of the target dataset classes.
Our method can be interpreted as an extension of this high level idea (i.e. minimizing distance between datasets), but we make several significant modifications to account for our goals of efficiency and application to unsupervised settings.
Training Only with Target Data. We do not train a network on a large scale dataset, i.e. JFT-300M [cui2018large], as this defeats the entire goal of pre-training efficiency. Therefore, we first train a model with parameters using the target dataset and use the learned to perform filtering.
Consider Source Images Individually. Selecting entire classes of pre-training data is inefficient and wasteful when limited to selecting a small subset of the data. For example, if limited to 6% of ImageNet, (a reasonable budget for massive datasets), we can only select 75 of the 1000 classes, which may prohibit the model from having the breadth of data needed to learn transferrable features. Instead, we treat each image from separately to flexibly over-represent relevant classes while not being forced to wholesale select entire classes. Additionally, very large scale datasets may not have class labels , so we want to develop methods that work with unsupervised learning, and treating source images independently accomplishes this.
Scoring and Filtering.
Finally, we choose to perform K-Means clustering on the representationslearned by to get cluster centers . We then compute the distances between and as
where p is typically 1 or 2 (L1 or L2 distance). We can score by considering an Aggregation Operator of either average distance to the cluster centers
or minimum distance
To filter, we sort by in ascending order znd select images to create and pre-train on it.
Advantages of our Method Performing unsupervised clustering ensures that our method is not fundamentally limited to classification tasks and also does not assume that images in the same class should be grouped together. Furthermore, our method does not assume source data set labels and requires only a relatively cheap single forward pass through the pre-training dataset. It attains our goals of efficiency and flexibility, in contrast to prior work such as [ge2017borrowing, cui2018large]. We outline the algorithm step-by-step in Algorithm 1 and lay out the method visually in Figure 2.
4.1.2 Conditional Filtering with Domain Classifier
In this section, we propose a novel domain classifier to filter , with several desirable attributes .
Training. In this method, we propose to learn to ascertain whether an image belongs to or . is learned on a third dataset where , consisting of and a small random subset of . Each source image receives a negative label and each target image receives a positive label giving us the label set . We then learn on using cross entropy loss as
Scoring and Filtering. Once we learn we obtain the confidence score for each image . We then list sort the source images in descending order based on and choose the top images to create the subset .
Interpretation. It can be shown[grover2019bias] that the Bayes Optimal binary classifier assigns probability
for an image to belong to the target domain, where and
are the true data probability distributions for the target and source domains respectively. Therefore, this method can be interpreted as selecting images from the pre-training domain with high probability of belonging to the target domain. Our method is very efficient and can be applied to a wide range of target tasks, moreso than clustering, since a learned binary domain classifier is a much simpler and faster task than learning accurate representations.
Classifier Calibration. This method can be augmented by improving the calibration of with a technique such as temperature scaling [guo2017calibration] or focal loss [lin2017focal]
and using these improved estimates to perform importance sampling bias correction during pre-training, as explained in[grover2019bias]. However, we find that the method still works very well on a wide range of tasks and datasets without any further tuning, an attractive factor when considering flexibilty and efficiency. We outline the algorithm step-by-step in Algorithm 2 and provide a depiction in Figure 3.
4.2 Adjusting Pre-training Spatial Resolution
To augment our methods, we propose changing spatial resolution of images in the source dataset while pre-training. We assume that an image is represented as or where , , where and represent image width and height. Traditionally, after augmentations, we use and . Here, we consider decreasing and on the pre-training task while maintaining on the target task. It is easy to see that reducing image resolution while pre-training can provide significant speedups by decreasing FLOPs required by convolution operations. Indeed, our experiments show that downsizing image resolution by half almost halves the pre-training time.
Training on downsized images and testing on higher resolution images has previously been explored. In [touvron2019fixing], due to geometric camera effects on standard augmentations, the authors report performance gains by training on lower resolution image and testing on normal resolution. Our setting is not as amenable to the same analysis, as we have separate data distributions and captured under different settings, and we perform the same training augmentations during both pre-training and finetuning. Nevertheless, we show low resolution training is still an effective method in the transfer learning setting.
|Stanford Cars [WelinderEtal2010]||196||8143||8041|
|Caltech Birds [krause20133d]||200||6000||2788|
|Functional Map of the World [christie2018functional]||62||18180||10609|
In all experiments, we report finetuning performance for each combination of resolution, pre-training budget, and filtering method. For context, we also report performance with full pre-training and no pre-training.
5.1.1 Source Dataset
We test and validate our methods in a wide range of settings. For our source dataset, we utilize ImageNet-2012 [deng2009imagenet], with 1.28M images over 1000 classes. Downstream tasks include fine grained classification, more general classification, and object detection. We experiment under two data budgets, limiting filtered subsets to 75K (6%) and 150K (12%) ImageNet images. This is an appropriate proportion when dealing with pre-training datasets on the scale of tens of millions or more images.
5.1.2 Target Datasets
Classification. As target datasets, we utilize the Stanford Cars [WelinderEtal2010] dataset, the Caltech Birds [krause20133d] dataset, and a subset of the Functional Map of the World [christie2018functional] (fMoW) dataset. We provide basic details about these datasets in Table 1. These datasets lend important diversity to validate the flexibility of our methods. Cars is a dataset with fairly small distribution shift from ImageNet, and pre-training on ImageNet performs well on it [cui2018large]. Birds contains a larger shift and a dataset emphasizing natural settings, such as iNat [cui2018large, van2018inaturalist], performs better than ImageNet. Finally, fMoW, consisting of overhead satellite images, contains images very dissimilar to ImageNet. Additionally, Birds and Cars are fine grained, discriminating between different species of birds or models of cars, respectively. In contrast, satellite images is much more general, describing buildings or landmarks [sarukkai2020cloud, uzkent2020efficient, uzkent2020learning].
Object Detection. [he2019momentum, he2020momentum]
show that unsupervised ImageNet pre-training is most effective when paired with challenging downstream tasks like object detection. Therefore, we also perform experiments in the object detection setting to validate the effectiveness and adaptability of our methods in more challenging tasks than classification. We utilize the Pascal VOC[everingham2010pascal] dataset for object detection with unsupervised ImageNet pre-training of the base feature extractor.
We experiment with clustering based filtering, using clusters and both average and min distance to cluster centers, as well as our dataset classifier method, using ResNet-18[he2016deep]
as our classifier. Furthermore, we combine our filtering methods with downsizing pre-training image resolution from 224x224 to 112x112 using bilinear interpolation. We filter with 224x224 source images, but use it for lower resolution pre-training as well to assess flexibility, as we want robust methods that do not need to be specifically adjusted to the pre-training setup
Efficiency Comparison. Compared to clustering, domain classifier filtering is more efficient because it bypasses the most expensive step in either method: training a model to extract feature representations. It replaces this with a much simpler binary classification task, which can be trained to achieve
accuracy in five or fewer epochs, typically requiring around an hour on a single GPU. Furthermore, the forward pass is quicker and does not require clustering or computing distances.
Our methods are both flexible, being readily applicable to the unsupervised pre-training setting. However, using the clustering method for a more challenging task, like object detection, while feasible, requires performing a costlier and more challenging target task like object detection or unsupervised learning simply to extract feature vectors. This further erodes efficiency relative to the classifier, which is extremely adaptable and can be applied with no modifications to any pair of datasets or target task while maintaining speed and performance. We confirm this by using it on object detection, where performing clustering based filtering would require significantly more effort.
Qualitative Validation. In Figures 4 and 5, we visualize some of the highest scoring filtered images for all our methods on classification tasks and verify that our filtering methods do select images with relevant features to the target task. Unsurprisingly, more interpretable images are selected for Birds and Cars, as there are no satellite images in ImageNet. Nevertheless, we see that the selected images for fMoW stil contain relevant features such as color, texture, and shapes.
5.3 Transfer Learning for Image Recognition
We apply our methods to the task of image classification with both supervised and unsupervised pre-training. We detail our setup, results, and experiments below.
5.3.1 Experimental Setup
For classification tasks, we finetune by changing only the linear layer of the pre-trained model and then training all the weights on target dataset.
Supervised Pre-training. For supervised pre-training, in all experiments, we utilize the ResNet-34 model [he2016deep] on 1 Nvidia-TITAN X GPU. We perform standard cropping/flipping transforms for ImageNet and the target data. For pre-training, we pre-train on the given subset of ImageNet for 90 epochs, utilizing SGD with momentum .9, weight decay of 1e-4, and learning rate .01 with a decay of 0.1 every 30 epochs. We finetune for 90 epochs with a learning rate decay of 0.1 every 30 epochs for all datasets. For Cars and Birds, we utilize SGD with momentum .9 [qian1999momentum], learning rate 0.1, and weight decay of 1e-4. For fMoW, we utilize the Adam optimizer [kingma2014adam] with learning rate 1e-4.
Unsupervised Pre-training. For unsupervised pre-training, we utilize the state of the art MoCo-v2 [he2020momentum] technique using a ResNet-50 model [he2016deep] in all experiments. We train on 4 Nvidia GPUs. MoCo [he2019momentum, he2020momentum] is a self-supervised learning method that utilizes contrastive learning, where the goal is to maximize agreement between different views of the same image (positive pairs) and to minimize agreement between different images (negative pairs). Our choice to use MoCo is driven by (1) large image sizes, and (2) computational cost. Compared to other self-supervised frameworks, such as SimCLR [chen2020simple], which require a batch size of 4096, MoCo uses a momentum updated queue of previously seen samples and achieves comparable performance with a batch size of just 256 [he2019momentum].
We keep the same data augmentations and hyperparameters used in[he2020momentum]. We finetune for 100 epochs using a learning rate of 0.001, batch size of 64, SGD optimizer for Cars and Birds, and Adam optimizer for fMoW.
5.3.2 Supervised Pre-training Results
We present target task accuracy for all our methods on Cars, Birds, and fMoW along with approximate pre-training and filtering time in Table 2.
Effect of Image Resolution. We see that downsizing resolution by a factor of 2 yields slight gains on Cars and slight losses on Birds and fMoW, while being faster than full pre-training. These trends suggest that training on lower resolution images can help the model learn more generalizeable features for similar source and target distributions. This effect erodes slightly as we move out of distribution, but the efficiency-to-accuracy tradeoff offered by downsizing is well worth it in every setting.
Impact of Filtering. We find that our filtering techniques consistently provide up to a performance increase over random selection, with a relatively small increase in cost due to the efficiency of our filtering methods. Unsurprisingly, filtering provides the most gains on cars and birds where the target dataset has a smaller shift. On fMoW, it is very hard to detect ”similar” images to ImageNet, as the two distributions have very little overlap. Nevertheless, in this setting, our filtering methods can still select enough relevant features to provide a boost in most cases.
Comparison of Filtering Methods.
While all our methods perform well, applying a finer lens, we see that the dataset classifier is less variable than clustering and never does worse than random. On the other hand, average clustering performs well on Cars or fMoW, but does worse than random on Birds and vice versa for min clustering. These methods rely on computing high dimensional vector distances to assign a measure of similarity and are susceptible to the usual associated pitfalls of ouliers and the curse of dimensionality. They also require accurate representation learning, which is more challenging than simply running a dataset classifier.
5.3.3 Unsupervised Pre-training Results
We observe promising results in the supervised setting, but as explained, a more realistic and useful setting is the unsupervised setting due to the difficulties inherent in collecting labels on large scale data. Thus, we use MoCo-v2 to pre-train on ImageNet and present results for Cars, Birds, and fMoW in Table 3.
Effect of Image Resolution. We find that in the unsupervised setting, with 150K pre-training images, lower resolution pre-training does not see slight drops as the target distribution shifts. This can be attributed to image resolution mattering more when performing supervised learning, due to the intricate label distributions, but mattering less when images are related on a more general level as is done in unsupervised learning. We see that filtering is somewhat more effective with normal resolution, but as mentioned, we perform filtering at normal resolution to test robustness. Fixing the pre-training resolution and filtering at this resolution would likely improve its already strong performance.
Increased Consistency of Clustering. The classifier still performs very well, giving the best single result on Birds and Cars, but relative to supervised, clustering is more consistent, likely because contrastive learning makes similar assumptions and also considers high dimensional feature vector distances.
Increased Effectiveness of Filtering. Our filtering techniques, particularly the dataset classifier, primarily aim to separate the image distributions based on the true image distributions and instead of considering the marginal or conditional source class distributions or (which may not even be observable), unlike [ngiam2018domain] Due to the emphasis on feature similarity, unsupervised learning better takes advantage of our filtering methods, and this is clear when looking at results. Performance on fMoW, with its large distribution shift, is similar, but on Birds and Cars, we see gains of up to over random filtering in the 75K setting and up to in the 150K setting, a larger boost than during supervised pre-training.
Performance Relative to Full Pre-training. On the back of the aforementioned gains, we can achieve performance that is within of full unsupervised pre-training but close to 10 times faster, due to filtering and downsizing. This is an excellent tradeoff relative to the supervised setting, where performance was worse than full pre-training. Here, we can match that pre-training on just 75K () low resolution images yielding a 15 times speedup over full pre-training. These results are extremely encouraging, because, as mentioned, we anticipate that unsupervised learning with be the default method for large scale pre-training.
5.4 Transfer Learning for Object Detection
As mentioned previously, our dataset classifier is more adaptable than clustering. To validate this, we pivot to object detection and confirm that the classifier produces large performance gains over random filtering on Pascal VOC with no extra effort. We do not experiment with clustering here, as it would take significantly more effort.
Experimental Setup. We use a standard setup for object detection with a Faster R-CNN detector with a R50-C4 backbone as in [he2019momentum, he2017mask, wu2019detectron2]. We pre-train the backbone on ImageNet with MoCo-v2. We finetune for 24k iterations ( 23 epochs) on trainval2007 (
5k images). We evaluate on the VOC test2007 set with the default metric AP50 and the more stringent metrics of COCO-style[lin2014microsoft] AP and AP75.
|pre-train. Sel. Method||AP||AP50||AP75||AP||AP50||AP75|
Results. We present results in Table 4 using no pre-training and full pre-training, for context, as well as with random filtering and classifier based filtering combined with image resolution downsizing.
Effect of Image Resolution. For all three metrics, pre-training on low resolution images produces a marginal decrease in performance, with the usual corresponding reduction in training time, impressive on a challenging task like object detection and confirming the adaptability of pre-training on lower resolution. To reiterate, we perform filtering at 224x224 but still see large filtering gains at 112x112, confirming the robustness of these methods.
Impact of Dataset Classifier Filtering. We observe that our proposed dataset classifier filtering technique performs impressively with a consistent gain (almost in one case) over random filtering across all metrics in all scenarios, with the largest gains seen on more stringent metrics like AP75. These results validate the adaptability, flexibility, and effectiveness of our method, as it provides significant boosts for both classification and object detection while seamlessly transitioning between the two domains.
We present novel methods to perform filtered pre-training for transfer learning, conditioned on a target dataset. In contrast to prior work, we focus on efficiency, flexibility, and adaptability, in addition to performance. Our methods require at most one forward pass through the source dataset, can easily be applied to source datasets without labels, and, given a pre-training data budget, achieve superior performance to random filtering on both classification and object detection, with little extra cost. We also discover that changing pre-training image resolution significantly shortens cost with minimal, if any, performance loss. Unsupervised pre-training supercharges our filtering methods in all settings, due to our focus on identifying relevant features, not classes. Here, we can achieve comparable performance to full pre-training, with just a small fraction of the training cost, an encouraging result as large scale unsupervised pre-training will likely be the default pre-training method in the future.