Revisiting Pre-training: An Efficient Training Method for Image Classification

11/23/2018 ∙ by Bowen Cheng, et al. ∙ 10

The training method of repetitively feeding all samples into a pre-defined network for image classification has been widely adopted by current state-of-the-art. In this work, we provide a new method, which can be leveraged to train classification networks in a more efficient way. Starting with a warm-up step, we propose to continually repeat a Drop-and-Pick (DaP) learning strategy. In particular, we drop those easy samples to encourage the network to focus on studying hard ones. Meanwhile, by picking up all samples periodically during training, we aim to recall the memory of the networks to prevent catastrophic forgetting of previously learned knowledge. Our DaP learning method can recover 99.88 ResNet-50, DenseNet-121, and MobileNet-V1 but only requires 75 training compared to those using the classic training schedule. Furthermore, our pre-trained models are equipped with strong knowledge transferability when used for downstream tasks, especially for hard cases. Extensive experiments on object detection, instance segmentation and pose estimation can well demonstrate the effectiveness of our DaP training method.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: An illustration of our proposed training method. Each row denotes a training cycle. Left column denotes the initial stage in each training cycle, which considers all available training example (orange dots). As training goes on in each cycle, we gradually discard training examples (gray) to save training computation.

denotes a training epoch and

denotes interval. Samples are removed by dropping examples with lowest losses and removed samples might be different in different training stages.

Pre-training on a large-scale image classification dataset has shown to be effectiveness for improving the performance of many downstream visual recognition tasks, e.g. object detection [23, 15, 7, 16], semantic segmentation [34, 2], human pose estimation [32, 28] etc. Although efforts have been made to find more advanced architectures [26, 27] or normalization method [31] to train these downstream tasks from scratch without pre-training, the gap is still not negligible. The ImageNet [5] has been used as the standard pre-training dataset for a long time. Many recent works [2, 19] find that the larger dataset used for pre-training, the larger gain downstream tasks will benefit from. As the larger dataset is considered, the process of pre-training brings new challenges in terms of both of the budget (resources) and time. For example, the pre-trained model in [19] is trained on 336 GPUs across 42 machines for 22 days of training. Both of the training time and computational resources are likely not affordable for many research groups. Thus, it would be helpful if one can find a more efficient training procedure to reduce the training time, which makes the pre-training can better serve for other downstream tasks. The increase in pre-training time is due to the increase in the number of training examples. Current dataset construction techniques simply sample images randomly from the web (e.g. Flickr, Instagram) with some keywords or hashtags [5, 19]. However, the contribution of each example varies significantly during training which has not been taken into account by current training strategies. We ask the question that can we come up with a more efficient training method by taking into account the contribution of each example to reduce the training time given fixed computational resource?

Actually, human beings are exposed to a world with data at a much larger scale even compared with the largest datasets but they can still learn well efficiently. We observe that when human beings begin to learn and develop their intelligence, they (common people) tend to start with learning a broad area of knowledge without digging into the depth (similar to using all training examples). As the learning process proceeds, people will focus on a narrow and much more difficult area (a “drop” in the number of training examples). When people start to forget what they have learned before, however, people can quickly catch up by reviewing all materials in a short time (a “pick” in easy examples). Motivated by this observation of human learning process, we propose an efficient training method named Drop-and-Pick (DaP) to imitate this process, so that all the training data can be adopted in a more efficient way.

Concretely, we illustrate our DaP training method in Figure 1. The basic idea is to divide the training process into different cycles. In each cycle, the network first learns from all training examples (this is similar to the stage when human learns a broad knowledge). As the training process goes on, we gradually feed the network with only a subset of the training examples (similar to human learning a specific subject). To prevent the model from forgetting what it learned before, we periodically repeat this training cycle where all training data is used again and restart the process of reducing training data for each training epoch.

We perform extensive experiments on a variety of benchmarks (CIFAR10, CIFAR100, ImageNet) for image classification to first verify the efficiency and effectiveness of our proposed method. Specifically, our proposed method only requires 63%, 79%, and 75% training computation to achieve comparable results (w.r.t. using all training examples) on CIFAR10, CIFAR100, and ImageNet respectively. We also evaluate the generalization ability of pre-trained models using our proposed method on various downstream tasks, including object detection, instance segmentation, and human pose estimation. We observe that although less computation is used in pre-training, the generalization ability is not affected in these downstream tasks. More interestingly, we find the performance of hard examples (e.g. small objects in object detection and instance segmentation) are consistently improved.

Our contributions are threefold:

  • We propose an efficient training method named Drop-and-Pick (DaP) that can save training computation while retaining the performance.

  • We validate the proposed method on different benchmarks including a large-scale dataset (ImageNet).

  • We verify that although less training computation is used, the generalization ability of our pre-trained models is not affected in a variety of downstream tasks, with certain improvement in most experiments even.

2 Related Works

The propose of our method is to accelerate the speed in the training stage. In this section, we compare different methods that reduce the training time.

2.1 Fast Training Methods

Works have been done to explore methods for training a model as fast as possible assuming unlimited computational resource. Large batch size is used in [6] with a carefully designed learning rate warm-up schedule. [6] trains a ResNet-50 model in 1 hour with 256 GPUs without loss in accuracy. Mixed precision training is explored in [20]

where the forward pass and backward pass are computed in half precision and parameter update in single precision. With the latest tensor core in the Nvidia Volta architecture, mixed precision can speed-up training by nearly 50%. A combination with faster inter-node communication has been studied in

[25, 11]. In particular, [11] reports the training time of a ResNet-50 model is 7 minutes with 1024 GPUs without loss in accuracy. Although fast training has been studied, the amount of computational resource is intractable for small research labs or individual researchers. In this paper, we try to speed-up training with limited computational resource (e.g. with 4 GPUs available). And our method is in orthogonal with [6, 20, 25, 11] and we believe the proposed method can further speed-up training of these methods.

2.2 Efficient Network Design

Designing efficient networks with more efficient operators is yet another way to reduce the training time. The MobileNet [9, 24] family replace a regular convolution by a depth-wise separable convolution which is consist of a depth-wise convolution and a regular convolution. The ShuffleNet [33, 18] family further replace the regular convolution by a group convolution and introduce a channel shuffle operation to break the channel dependency. However, one problem of the efficient network is that it usually reduces the computation in the cost of loss in accuracy. Yet another problem is that experiments show the convergence of efficient networks is slower. A ResNet [8] converges normally in 90-120 epochs, however, it takes more than 480 epochs to train smaller networks [24, 33]. Thus, an efficient network design might not always result in faster training time.

2.3 Active Learning

Active learning is a method to expand the size of datasets by selecting examples that are most helpful for a specific network to label. The key idea in active learning is to select the most uncertain unlabeled examples to label. In some sense, this idea is relevant to our proposed strategy that we keep the most uncertain labeled examples.

Joshi et al.[12]

proposes to use an SVM classifier where uncertainties are calculated based on probabilistic outputs over the class label. Osband

et al. [21]

shows how to obtain uncertainty measures with neural networks. Lakshminarayanan

et al. [14]

uses extra head in neural network that is trained to estimate the variance. It allows variance along with predictions to be used to estimate uncertainty. In this work, we simply use the training loss as an indicator of the uncertainty of an example. The larger the loss, the more uncertain the example is to the network.

3 Approach

, set of all training images, labels
, set of sampled training images, labels
set of training losses
set of network outputs
sampled mini-batch
list of epochs to reuse all data
number of training epochs
number of warm-up epochs
number of interval epochs
percentage of samples to keep
number of active epochs
Table 1: Notations in Algorithm 1.

3.1 Motivation

When human beings begin to learn and develop their intelligence, they (common people) tend to start with learning a broad area of knowledge without digging into the depth. For example, in the early stage of school, people take a variety of courses in different subjects (e.g. math, physics and chemistry) at the introduction level. As the learning process proceeds, people will focus on a narrow and much more difficult area (e.g. calculus in math). During this process, people start to forget what they have learned before because they do not use this knowledge often (e.g. physics, chemistry). However, if people want to catch up materials that they have forgotten, they can do it in a much shorter time and usually, they can even learn better.

The process of this learning method is extremely efficient, as people gradually focus on less and less “hard” knowledge. This strategy (often) adopted by humans motivates this work. Specifically, we ask the question that whether deep neural networks can be trained efficiently by imitating the learning method of human beings. The goal of this work is to train a model with as less computation as possible while maintaining the performance. Without loss of generality, we demonstrate that the goal is achievable in the task of image classification.

3.2 Abstraction of Human Learning Process

We begin by abstracting the human learning process111Note that it does not

necessarily mean human truly learns in this way from neuron-science perspective.

. The broad area of knowledge human learned during their early stage is similar to feeding all training data to the model at the start of the training iterations. As human begin to focus on a more difficult subject, we can imitate this process by gradually feeding only the most difficult examples (examples with large losses) to the model. Since human periodically review what they have learned before, we also periodically feed all the training data again to the model and restart the above process. A detailed training method will be discussed in the following.

Blue part is the standard training loop.
Red part is our method. (Best viewd in color)

input: ,
input: , , , , ,
for  do
     for every batch  do
         Update model weights
     end for
     if  then
         if  and  then
         end if
     end if
     if  then
     end if
end for
Algorithm 1 DaP Training Policy

3.3 Drop-and-Pick Training

In this section, we first summarize the set of training parameters (listed in Table 1) and their notations that will be used throughout this paper. Then, the details of how to conduct the training of “Drop-and-Pick” (DaP) are provided.

Notations We set a number of “warm-up” epochs () at the beginning of training. In this “warm-up” period, the model is guaranteed to see all training examples. This is an imitation of human taking a broad range of courses. Then, we use a combination of number of interval epochs () and a keep percentage () to control the length of a learning stage (subsample examples every epochs) and the number of remaining examples (keep of the current examples) respectively. Furthermore, a number of active epochs () is used to control the number of learning stages we have (stop sampling after epochs). Finally, we set a list of epochs to reuse all training examples to prevent forgetting.

Training Details Our DaP can be summarized as the process of periodically dropping easy examples and picking up all examples. As shown in Algorithm 1, we divide the overall training process into different cycles/periods denoted by . In each period, we keep of the hardest examples every epochs (the “drop” stage). To find the of the hardest examples, we first sort examples by their losses in descending order and keep the top- examples. When the training process goes to a new period/cycle, we reuse all training examples (the “pick” stage) followed by another “drop” stage. The only exception is the first period where we set a warm-up period to use all examples to “warm-up” the model. During the first epochs, we keep all examples for training. We also set an optional “active” epochs . If an is set, then the “drop” stage only lasts for epochs within each training cycle.

We experiment different choices of hyper-parameters in Section 4.2.1, except the choice of . Since we periodically reduce learning rate during training, we simply use the set of epochs when the learning rate is reduced as our set . During the training, the standard Softmax Cross-entropy loss is employed for optimization.

4 Experiments

Warm-up Accuracy Computation all data 94.54 100% baseline 94.63 63% no warm-up 93.94 53% 20 94.26 58% 40 94.63 63% 60 94.73 67% 70 94.80 69% 90 95.04 72% Interval Accuracy Computation all data 94.54 100% baseline 94.63 63% 2 93.21 27% 5 93.71 44% 10 94.63 63% 15 94.63 73% 20 94.68 79% 25 95.08 83% Percentage Accuracy Computation all data 94.54 100% baseline 94.63 63% 0.7 93.44 35% 0.75 93.67 39% 0.8 94.03 45% 0.85 94.45 52% 0.9 94.63 63% 0.95 94.80 78% (a) (b) (c) Active Accuracy Computation all data 94.54 100% baseline 94.63 63% 20 94.74 92% 60 94.64 72% no active 94.63 63% Reusing Data Accuracy Computation all data 94.54 100% baseline 94.63 63% No reusing 93.80 45% Reusing at 150 94.63 63% Reusing at 150/225 94.56 73% Strategy Accuracy Ours (63% computation) 94.63 Random (63% computation) 93.96 All data 63% epochs 94.43 63% data 92.21 (d) (e) (f)
Table 2: Ablation studies results. Evaluate on CIFAR10 test set. Baseline is ResNet-110 with warm-up epochs=40, interval=10, percentage=0.9, active=, reusing all data at epochs. (a) Ablation study on number of warm-up epochs. (b) Ablation study on number of interval. (c) Ablation study on percentage. (d) Ablation study on number of active epochs. (e) Ablation study on strategies of reusing all training data. (f) Ablation study on strategies of training with same amount of computation.
Interval Accuracy Computation all data 74.26 100% baseline 74.05 79% 10 72.45 63% 20 74.05 79% 30 74.25 86% Active Accuracy Computation all data 74.26 100% baseline 74.05 79% 40 74.43 92% 60 74.07 87% no active 74.05 79% (a) (b)
Table 3: Ablation studies results. Evaluate on CIFAR100 test set. Baseline is ResNet-110 with warm-up epochs=40, interval=20, percentage=0.9, active=, reusing all data at epochs. (a) Ablation study on number of interval. (b) Ablation study on number of active epochs.
dataset all data 63% computation 79% computation
CIFAR10 (5000 images/class)
CIFAR10-5000 (500 images/class)
CIFAR100 (500 images/class)
Table 4: Data redundant in CIFAR datasets.
original re-implement 63% computation 83% computation
ResNet-110 [8] 93.57 94.54
ResNet-164 [8] - 95.47
DenseNet-BC-100 [10] 95.49 95.16
Table 5: CIFAR10 accuracy ().
original re-implement 79% computation 92% computation
ResNet-110 [8] - 74.26
ResNet-164 [8] 74.84 77.29
DenseNet-BC-100 [10] 77.73 76.88
Table 6: CIFAR100 accuracy ().

We present image classification results on multiple common benchmarks with state-of-the-art architectures in this section. We begin with a set of ablation studies to validate the importance and efficiency of our training strategies on the CIFAR10 and CIFAR100 dataset [13]. Moreover, we also validate our proposed method on a much larger scale dataset the ImageNet2012 [5]. Finally, we use the pre-trained model on ImageNet2012 dataset for a variety of downstream tasks, including object detection, instance segmentation and human pose estimation. We find that our proposed method, although efficient, generalizes extremely well to other tasks. Furthermore, the proposed method can transfer knowledge distilled from hard examples in image classification tasks to other downstream tasks. For example, we observe a consistent gain in the performance of hard classes in these downstream tasks (e.g. small object detection/segmentation).

4.1 Implementation Details

All models in this paper are implemented using PyTorch

[22]. If otherwise stated, models on the CIFAR datasets are trained with 300 epochs, learning rate is divided by 10 at 150 and 225 epoch respectively. On the ImageNet dataset, models are trained with 120 epochs, dividing the learning rate by 10 every 30 epochs. We follow the batchsize, learning rate, weight decay settings in their original implementations.

During the training, we accumulate the loss of each image on the fly with data augmentation at each training iteration. When we sample a subset of training example , we simply sort examples according to their losses in descending order and keep the top examples to form our new training set . When we report the training computation, we set the strategy of using all data as comparison baseline (100% computation). And we compute the ratio of between total examples used by our proposed method with the baseline as the computation for our method.

4.2 Cifar

CIFAR10 and CIFAR100 datasets [13] both consist of colored natural images with

pixels. CIFAR10 consists of images with 10 classes and CIFAR100 has 100 classes. The training and test sets contain 50,000 and 10,000 images respectively for both datasets. We adopt standard data augmentations (random crop and random flip) and normalize the data with the channel means and standard deviations. We train our models with 50,000 training images and report the test accuracy on the test set.

4.2.1 CIFAR10: Ablation Studies

We conduct a set of thorough ablation studies on the CIFAR10 dataset to verify the importance of our training strategy as well as to find a good combination of hyper-parameters. Since the search space for hyper-parameters is extremely large, the resulting combination is not guaranteed to be optimal. More specifically, we study the influence of following hyper-parameters:

  • warm-up: warm-up epochs at the beginning of the training, during this period, no sampling is performed

  • interval: sampling is performed every interval epochs (a.k.a. frequency)

  • percentage: percentage of examples to keep (w.r.t. the current training examples) when sampling

  • active: active epochs, the number of epochs to perform sampling starting from the epoch that all data is used for training

  • reusing data: whether to reuse all data during training; no reusing: do not reuse all training data; reuse at xx: reusing all data and start over the sampling process at xx epoch

  • strategy: comparison with other strategies for reducing the training computation; random: same as our method but sampling is performed by randomly selection; all data xx% epochs: use all data but reduce the number of training epochs; xx% data: reduce the number of training data by random sample

Results are shown in Table 2. We use the ResNet-110 [8] model. Our baseline uses the setting warm-up epochs=40, interval=10, percentage=0.9, no active epochs, reusing all data at epochs. This baseline setting already achieves a marginal gain of 0.09% over the all data counterpart but only requires 63% computation during the training. In Table 2 (a), we find the longer the warm-up epochs we use, the better the performance and the more computation for training is needed. When no warm-up is used, we observe a significant drop in accuracy. It verifies the importance of using all training data at the early stage of training. In Table 2 (b) and (c), we study the effect of sampling interval and percentage. Basically, the longer the interval and the larger the percentage, the better performance we will get and it also results in more computation. The same trend is observed for the number of active epochs. As shown in Table 2 (d), the shorter the active epochs, the better the performance and we need more computation. The results of reusing data is in Table 2 (e). We observe there is a severe drop in accuracy (0.83%) if we do not reuse all data throughout the training. This observation can well validate the importance of the “Pick” operation in our DaP approach. Finally, we compare our method with other methods that achieve the same training computation in Table 2 (f). First, instead of keep the most difficult percent of training examples, we use a uniform random-sampling. There is a drop of 0.67% in accuracy, meaning mining hard examples is important. Furthermore, we compare our method with a scaled schedule (training with all data but with shorter iterations) and our method achieves better performance by 0.20%. We also compare our method with training with less data. The gap is much larger (2.42%) compared with other strategies. This means big data helps and although our method uses less total training examples, the big data used at beginning of the training is important.

4.2.2 CIFAR100: Ablation Studies

When using the same baseline setting as that is used for CIFAR10, we observe a drop in accuracy from 74.26% to 72.45% (-1.81%) as shown in Table 3 (a) third row. We hypothesize this is because CIFAR100 has less redundant data (500 images/class) compared to CIFAR10 (5000 images/class) and we will verify this hypothesize in Section 4.2.3. Thus, we change interval=20 while keep other settings the same. Since we already perform a thorough ablation study on CIFAR10, we only “fine-tune” some of the settings to find a suitable combination of hyper-parameters for CIFAR100 dataset. We choose to only change the interval and number of active epochs. The results are shown in Table 3, we observe the same trend as that in CIFAR10. Especially in CIFAR100, we find it is important to set a number of active epochs.

4.2.3 Data Redundancy

We observe an interesting behavior in Setcion 4.2.1 and 4.2.2 that training with the same amount of computation (e.g. 63%) results in different performance gaps (+0.09 v.s. -1.81). We hypothesize that this is caused by different level of data redundancy in these two dataset. CIFAR10 dataset has 5000 images/class while CIFAR100 dataset only has 500 images/class. Intuitively, it is more save to drop more examples (per class) for CIFAR10 dataset than CIFAR100 dataset. To proof this hypothesize, we subsample the CIFAR10 dataset and keep examples per class. We name our constructed dataset CIFAR10-5000 since it ends up with only 5000 training examples (500 images/class).

We compare the same training settings on the three datasets and results are shown in Table 4. We observe that when training with 63% computation, the gap of CIFAR10-5000 (-2.14) is comparable with the gap of CIFAR100 (-1.81) dataset. These results verfy our hypothesize that the large different gap between CIFAR10 and CIFAR100 at this computation level is caused by different level of data redundancy in the two datasets.

When using more computation, all results are improved. We notice that the improvement of CIFAR100 is much larger than the improvement of CIFAR10-5000. We believe this is because CIFAR100 has a higher level of data redundancy compared with CIFAR10-5000, since CIFAR100 has more training examples.

Comparing the performance in the 3 datasets, we find data redundancy is one of the key reasons that makes dropping training examples not degrading the performance.

original re-implementation 75% computation 86% computation
top-1 err.
top-5 err.
top-5 err.
top-1 err.
top-5 err.
top-1 err.
top-5 err.
ResNet-18 [8] - -
ResNet-50 [8]
ResNet-101 [8]
ResNet-152 [8]
DenseNet-121 [10]
DenseNet-169 [10]
MobileNet-V1-1.0 [9] -
Table 7: Single-crop error rates (%) on the ImageNet validation set and training computation comparisons. The original column refers to the results reported in the original papers. To enable a fair comparison, we re-train the baseline models and report the scores in the re-implementation column. The last two columns refers to different training methods with different computation compared with baselines. The numbers in brackets denote the performance improvement over the re-implemented baselines.
Method Backbone Pre-train Computation
FPN [15] ResNet-50 100% 35.8 57.7 38.2 20.2 39.5 45.9 - - - - - -
FPN [15] ResNet-50 75% 35.9 57.8 38.4 21.1 39.7 46.6 - - - - - -
FPN [15] ResNet-50 86% 36.1 57.9 38.9 20.6 39.9 46.9 - - - - - -
Mask R-CNN [7] ResNet-50 100% 36.6 58.0 39.5 20.8 40.1 47.6 33.5 54.8 35.4 17.0 36.8 46.0
Mask R-CNN [7] ResNet-50 75% 36.8 58.4 39.7 21.4 40.5 47.7 33.8 55.3 35.8 17.7 37.2 46.3
Mask R-CNN [7] ResNet-50 86% 36.9 58.5 39.9 21.0 40.5 47.8 33.8 55.1 35.9 17.5 37.1 46.2
Table 8: COCO2017 val detection and segmentation results. denotes box AP, denotes mask AP. Red: best, Blue: second best. In particular, our results using training computation consistently outperform the ones using and on small objects, which suggest our models have successfully transferred the knowledge learned on hard cases during pre-training.

4.2.4 Comparison with other Architectures

To validate our proposed method can be generalized to other architectures instead of ResNet-110, we perform experiments on three more state-of-the-art architecutures: ResNet-164 [8] and DenseNet-BC-100 [10] on both CIFAR10 and CIFAR100 datasets with two computation levels.

Results are shown in Table 5 and Table 6. On all these architectures, our method achieves on-par or even better results on both datasets.

4.3 ImageNet

The ImageNet2012 classification dataset [5] consists 1,281,167 images for training and 50,000 for validation with 1,000 classes. Following [29], we adopt random crop augmentation with patches covering of the entire image, aspect ratio augmentation of [, ], and random horizontal flip augmentation. All patches are resized to and normalize the data with the channel means and standard deviations. We report single-crop classification errors on the validation set for fair comparison.

4.3.1 Hyper-parameters Selection

It is time-consuming to do ablation study on such a large-scale dataset to find the optimal set of hyper-parameters, thus we directly scale the hyper-parameters found in CIFAR10 and taking the data redundancy level into account. The ImageNet2012 dataset has around 1200 image/class and models are usually trained for epochs. More specifically, interval used in CIFAR10 is 10 and we scale it down by 5=150/30. This is because models are trained with 150 epochs with the same learning rate on CIFAR10 whereas they are trained for 30 epochs on ImageNet with the same learning rate. Thus we use warm-up epochs=10, interval=2, percentage=0.9, active=10, reusing all data at epochs.

4.3.2 Comparison with SOTA Architectures

In this section, we apply our method with state-of-the-art architectures: ResNet [8], DenseNet [10], and MobileNet [9]. Results are shown in Table 7. We report both our re-implementation results and original results in their papers and our reproduced results are always better than original ones. For a fair comparison, we only compare our method with the reproduced results.

When we use the set of hyper-parameters in Section 4.3.1, the computation of the training is only 75% of that when all data is used. However, the drop in accuracy is less than 0.5% which is negligible. When we set the percentage , we achieve a training computation of 86% and we observe a consistent improvement in accuracy over most of architectures.

Method Backbone Pre-train Computation
SimpleBaseline [32] ResNet-50 100% 70.4 91.4 78.2 67.7 74.4 73.5 92.1 80.5 70.4 78.3
SimpleBaseline [32] ResNet-50 75% 70.7 91.5 78.2 67.9 75.1 74.0 92.6 80.7 70.7 78.9
SimpleBaseline [32] ResNet-50 86% 70.5 91.5 78.1 67.5 74.9 73.7 92.3 80.3 70.3 78.7
SimpleBaseline [32] ResNet-101 100% 72.0 91.5 79.4 69.2 76.4 75.3 93.0 82.0 72.0 80.2
SimpleBaseline [32] ResNet-101 75% 72.4 91.5 80.3 69.4 76.8 75.6 92.8 82.4 72.4 80.5
SimpleBaseline [32] ResNet-101 86% 72.6 92.5 80.3 69.6 77.0 75.7 93.1 82.4 72.4 80.7
Table 9: COCO2017 val keypoint results. denotes keypoint AP, denotes keypoint AR. Red: best, Blue: second best. The keypoint results are generally in accordance with those in Table 8 in terms of the knowledge transferability of our pre-trained models.

4.4 Transfer Knowledge to Downstream Tasks

Over the years, ImageNet2012 [5] has become the standard dataset to pre-train models for multiple downstream tasks (e.g. object detection [4, 30, 3], segmentation [34, 2], human pose estimation [32, 28]). To demonstrate that our approach can generalize well to these downstream tasks and can even transfer knowledge distilled during pre-training, we evaluate our pre-trained models a wide range of downstream tasks including object detection, instance segmentation, and human pose estimation.

4.4.1 Object Detection and Instance Segmentation

We evaluate the performance for object detection and instance segmentation on the COCO2017 dataset [17]. COCO2017 has 115k training images, 5k validation images and 20k test image. We train FPN [15] (for object detection) and Mask R-CNN [7] (for instance segmentation) on the 115k training set and evaluate the performance on the 5k validation set. We use the open-sourced framework mmdetection [1]. We use a single scale of [800, 1333] during training and testing.

Results are shown in Table 8. For a fair comparison, all models are initialized with PyTorch pre-trained models respectively. We notice that the baseline (pre-train computation: 100%) result is lower than the one reported in [15, 7] (35.8 compared with 36.7). This is because [15, 7]

use the original Caffe models released by

[8] whose mean and variance of BatchNorm layers are strictly computed using average (not moving average) on a sufficiently large training batch after the training procedure. Since we fix BatchNorm layer during training, the re-calibrated mean and variance is important to produce better result. We do not re-calibrate BatchNorm statistics as we do not know how [8] exactly evaluate mean and variance.

For both object detection task and instance segmentation task, models pre-trained with our method (75% computation and 86% computation) perform on-par with models pre-trained with all data (100% computation) in terms of the AP metric. This means our method can be used as an efficient way to pre-train models.

However, if we focus on the AP metric for small objects which is the most difficult case in object detection and instance segmentation, we observe a consistent gain. It means the knowledge distilled from hard examples in image classification tasks is transferable to solving hard cases in down stream tasks. More interestingly, the gain of 75% pre-trained model (+0.9 , +0.7 ) on small objects is larger than the gain of 86% pre-trained model (+0.4 , +0.5 ). This observation suggests that dropping more easier examples during pre-training might increase the performance for hard cases (small object) in object detection/instance segmentation.

4.4.2 Human Pose Estimation

We further experiment the performance on human pose estimation task using the COCO2017 keypoint dataset [17]. It contains more than 200k images and 250k person instances labeled with keypoints. 150k instances of them are publicly available for training and validation. We only use COCO2017 train (57K images and 150K person instances) and evaluate the performance on the COCO2017 val set. We use the same implementation in [32] and use a single input size of during training and testing.

Results are shown Table 9. All reported baseline results are similar or even better than the ones reported in [32]. We observe a consistent improvement in the human pose estimation task. Since the human pose estimation task do not consider small (less than pixels) human instances, the hard case in this task is easier than that in object detection and semantic segmentation.

We further perform experiments with ResNet-101 backbone in human pose estimation. Interestingly, with 86% pre-train computation, the performance of our DaP pre-trained model exceeds baseline by a large margin of 0.6 . It shows that DaP pre-trained models have the potential to further improve human pose estimation.

4.5 Discussions

Experiments show that the DaP training method can achieve comparable performance with less computation (less training time) even on large-scale image classification dataset. Furthermore, results show that different pre-training strategies have different performance in down stream tasks. In particular, we find DaP pre-trained models preserve the generalization ability that they do not degrade the performance of downstream tasks, e.g. object detection, instance segmentation and human pose estimation. More interestingly, DaP pre-trained models perform consistently better than simply pre-training using all data, especially on hard cases (small object detection). A good hypothesize (we cannot verify this currently) is that DaP pre-trained model distills knowledge from hard example in the pre-training tasks (i.e. image classification) and the knowledge is transferable to downstream tasks.

5 Conclusion

In this paper, we propose an efficient training method named “Drop-and-Pick” (DaP) aiming at reducing training time given limited computational resource. The method is validate on various image classification benchmarks. Specifically, we achieve comparable performance using only 75% computation on the ImageNet dataset. We further show that the DaP pre-trained models have no loss in generalization ability in many downstream tasks. More interestingly, we observe consistent improvements in hard classes for these downstream tasks when DaP pre-trained models are used. We will leave it as a future work to study how downstream tasks can benefit from different pre-training policies.

Acknowledgements. This work is in part supported by IBM-Illinois Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM AI Horizons Network, and by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/ Interior Business Center (DOI/IBC) contract number D17PC00341. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government. The authors thank Microsoft AI & Research for providing ImageNet baseline results and Jiashi Feng, Zilong Huang, Bin Xiao, Lei Zhang for helpful discussions.