Much of the recent progress made in image classification research can be credited to training procedure refinements, such as changes in data augmentations and optimization methods. In the literature, however, most refinements are either briefly mentioned as implementation details or only visible in source code. In this paper, we will examine a collection of such refinements and empirically evaluate their impact on the final model accuracy through ablation study. We will show that, by combining these refinements together, we are able to improve various CNN models significantly. For example, we raise ResNet-50's top-1 validation accuracy from 75.3 ImageNet. We will also demonstrate that improvement on image classification accuracy leads to better transfer learning performance in other application domains such as object detection and semantic segmentation.READ FULL TEXT VIEW PDF
ReImplementation of "Residual Attention Network for Image Classification" with MXNet Gluon.
Using different tricks to improve performance of resetnet. The final accuracy:95.21%
Solution of the Google Landmark Recognition 2019 Challenge
applying https://arxiv.org/abs/1812.01187 to imagenet dataset with pytorch
Since the introduction of AlexNet  in 2012, deep convolutional neural networks have become the dominating approach for image classification. Various new architectures have been proposed since then, including VGG , NiN , Inception , ResNet , DenseNet , and NASNet . At the same time, we have seen a steady trend of model accuracy improvement. For example, the top-1 validation accuracy on ImageNet  has been raised from 62.5% (AlexNet) to 82.7% (NASNet-A).
However, these advancements did not solely come from improved model
architecture. Training procedure refinements, including changes in loss functions, data
preprocessing, and optimization methods also played a major role.
A large number of such refinements has been proposed in the past years, but
has received relatively less attention.
In the literature, most were only briefly mentioned as implementation details while others can only be found in source code.
However, these advancements did not solely come from improved model architecture. Training procedure refinements, including changes in loss functions, data preprocessing, and optimization methods also played a major role. A large number of such refinements has been proposed in the past years, but has received relatively less attention. In the literature, most were only briefly mentioned as implementation details while others can only be found in source code.
In this paper, we will examine a collection of training procedure and model architecture refinements that improve
model accuracy but barely change computational complexity.
Many of them are minor ‘‘tricks’’ like modifying the stride size of a particular
convolution layer or adjusting learning rate schedule.
Collectively, however, they make a big difference.
We will evaluate them on multiple
network architectures and datasets and report their impact to the final model accuracy.
In this paper, we will examine a collection of training procedure and model architecture refinements that improve model accuracy but barely change computational complexity. Many of them are minor ‘‘tricks’’ like modifying the stride size of a particular convolution layer or adjusting learning rate schedule. Collectively, however, they make a big difference. We will evaluate them on multiple network architectures and datasets and report their impact to the final model accuracy.
|ResNet-50 ||3.9 G||75.3||92.2|
|ResNeXt-50 ||4.2 G||77.8||-|
|SE-ResNet-50 ||3.9 G||76.71||93.38|
|SE-ResNeXt-50 ||4.3 G||78.90||94.51|
|DenseNet-201 ||4.3 G||77.42||93.66|
|ResNet-50 + tricks (ours)||4.3 G||79.29||94.63|
Our empirical evaluation shows that several tricks lead to significant accuracy improvement and combining them together can further boost the model accuracy. We compare ResNet-50, after applying all tricks, to other related networks in Table 1. Note that these tricks raises ResNet-50’s top-1 validation accuracy from 75.3% to 79.29% on ImageNet. It also outperforms other newer and improved network architectures, such as SE-ResNeXt-50. In addition, we show that our approach can generalize to other networks (Inception V3  and MobileNet ) and datasets (Place365 ). We further show that models trained with our tricks bring better transfer learning performance in other application domains such as object detection and semantic segmentation.
We first set up a baseline training procedure in Section 2, and then discuss several tricks that are useful for efficient training on new hardware in Section 3. In Section 4 we review three minor model architecture tweaks for ResNet and propose a new one. Four additional training procedure refinements are then discussed in Section 5. At last, we study if these more accurate models can help transfer learning in Section 6.
Our model implementations and training scripts are publicly available in GluonCV 111https://github.com/dmlc/gluon-cv.
The template of training a neural network with mini-batch stochastic gradient
descent is shown in Algorithm
The template of training a neural network with mini-batch stochastic gradient descent is shown in Algorithm1. In each iteration, we randomly sample images to compute the gradients and then update the network parameters. It stops after passes through the dataset. All functions and hyper-parameters in Algorithm 1 can be implemented in many different ways. In this section, we first specify a baseline implementation of Algorithm 1.
We follow a widely used implementation  of ResNet as our baseline. The preprocessing pipelines between training and validation are different. During training, we perform the following steps one-by-one:
Randomly sample an image and decode it into 32-bit floating point raw pixel values in .
Randomly crop a rectangular region whose aspect ratio is randomly sampled in and area randomly sampled in , then resize the cropped region into a 224-by-224 square image.
Flip horizontally with 0.5 probability.
Flip horizontally with 0.5 probability.
Scale hue, saturation, and brightness with coefficients uniformly drawn from .
Add PCA noise with a coefficient sampled from a normal distribution
Add PCA noise with a coefficient sampled from a normal distribution.
Normalize RGB channels by subtracting 123.68, 116.779, 103.939 and dividing by 58.393, 57.12, 57.375, respectively.
During validation, we resize each image’s shorter edge to pixels while keeping its aspect ratio. Next, we crop out the 224-by-224 region in the center and normalize RGB channels similar to training. We do not perform any random augmentations during validation.
The weights of both convolutional and fully-connected layers are initialized with the
Xavier algorithm . In particular,
we set the parameter to random values uniformly drawn from ,
where . Here and are the input and output channel sizes,
respectively. All biases are initialized to 0.
For batch normalization layers,
are the input and output channel sizes, respectively. All biases are initialized to 0. For batch normalization layers,vectors are initialized to 1 and vectors to 0.
Nesterov Accelerated Gradient (NAG) descent  is used for training. Each model is trained for 120 epochs on 8 Nvidia V100 GPUs with a total batch size of 256. The learning rate is initialized to and divided by 10 at the 30th, 60th, and 90th epochs.
We evaluate three CNNs: ResNet-50 , Inception-V3 , and MobileNet . For Inception-V3 we resize the input images into 299x299. We use the ISLVRC2012  dataset, which has 1.3 million images for training and 1000 classes. The validation accuracies are shown in Table 2. As can be seen, our ResNet-50 results are slightly better than the reference results, while our baseline Inception-V3 and MobileNet are slightly lower in accuracy due to different training procedure.
Hardware, especially GPUs, has been rapidly evolving in recent years. As a result, the optimal choices for many performance related trade-offs have changed. For example, it is now more efficient to use lower numerical precision and larger batch sizes during training. In this section, we review various techniques that enable low precision and large batch training without sacrificing model accuracy. Some techniques can even improve both accuracy and training speed.
Mini-batch SGD groups multiple samples to a mini-batch to increase parallelism and decrease communication costs. Using large batch size, however, may slow down the training progress. For convex problems, convergence rate decreases as batch size increases. Similar empirical results have been reported for neural networks . In other words, for the same number of epochs, training with a large batch size results in a model with degraded validation accuracy compared to the ones trained with smaller batch sizes.
In mini-batch SGD, gradient descending is a random process because the examples are randomly selected in each batch. Increasing the batch size does not change the expectation of the stochastic gradient but reduces its variance. In other words, a large batch size reduces the noise in the gradient, so we may increase the learning rate to make a larger progress along the opposite of the gradient direction. Goyal
In mini-batch SGD, gradient descending is a random process because the examples are randomly selected in each batch. Increasing the batch size does not change the expectation of the stochastic gradient but reduces its variance. In other words, a large batch size reduces the noise in the gradient, so we may increase the learning rate to make a larger progress along the opposite of the gradient direction. Goyalet al.  reports that linearly increasing the learning rate with the batch size works empirically for ResNet-50 training. In particular, if we follow He et al.  to choose 0.1 as the initial learning rate for batch size 256, then when changing to a larger batch size , we will increase the initial learning rate to .
At the beginning of the training, all parameters are typically random values and therefore far away from the final solution. Using a too large learning rate may result in numerical instability. In the warmup heuristic, we use a small learning rate at the beginning and then switch back to the initial learning rate when the training process is stable . Goyal et al.  proposes a gradual warmup strategy that increases the learning rate from 0 to the initial learning rate linearly. In other words, assume we will use the first batches (e.g. 5 data epochs) to warm up, and the initial learning rate is , then at batch , , we will set the learning rate to be .
A ResNet network consists of multiple residual blocks, each block consists of several convolutional layers. Given input , assume is the output for the last layer in the block, this residual block then outputs . Note that the last layer of a block could be a batch normalization (BN) layer. The BN layer first standardizes its input, denoted by , and then performs a scale transformation . Both and are learnable parameters whose elements are initialized to 1s and 0s, respectively. In the zero initialization heuristic, we initialize for all BN layers that sit at the end of a residual block. Therefore, all residual blocks just return their inputs, mimics network that has less number of layers and is easier to train at the initial stage.
The weight decay is often applied to all learnable parameters including both weights and bias. It’s equivalent to applying an L2 regularization to all parameters to drive their values towards 0. As pointed out by Jia et al. , however, it’s recommended to only apply the regularization to weights to avoid overfitting. The no bias decay heuristic follows this recommendation, it only applies the weight decay to the weights in convolution and fully-connected layers. Other parameters, including the biases and and in BN layers, are left unregularized.
Note that LARS  offers layer-wise adaptive learning rate and is reported to be effective for extremely large batch sizes (beyond 16K). While in this paper we limit ourselves to methods that are sufficient for single machine training, in which case a batch size no more than 2K often leads to good system efficiency.
Neural networks are commonly trained with 32-bit floating point (FP32) precision. That is, all numbers are stored in FP32 format and both inputs and outputs of arithmetic operations are FP32 numbers as well. New hardware, however, may have enhanced arithmetic logic unit for lower precision data types. For example, the previously mentioned Nvidia V100 offers 14 TFLOPS in FP32 but over 100 TFLOPS in FP16. As in Table 3, the overall training speed is accelerated by 2 to 3 times after switching from FP32 to FP16 on V100.
Despite the performance benefit, a reduced precision has a narrower range that makes results more likely to be out-of-range and then disturb the training progress. Micikevicius et al.  proposes to store all parameters and activations in FP16 and use FP16 to compute gradients. At the same time, all parameters have an copy in FP32 for parameter updating. In addition, multiplying a scalar to the loss to better align the range of the gradient into FP16 is also a practical solution.
The evaluation results for ResNet-50 are shown in Table 3. Compared to the baseline with batch size 256 and FP32, using a larger 1024 batch size and FP16 reduces the training time for ResNet-50 from 13.3-min per epoch to 4.4-min per epoch. In addition, by stacking all heuristics for large-batch training, the model trained with 1024 batch size and FP16 even slightly increased 0.5% top-1 accuracy compared to the baseline model.
The ablation study of all heuristics is shown in Table 4. Increasing batch size from 256 to 1024 by linear scaling learning rate alone leads to a 0.9% decrease of the top-1 accuracy while stacking the rest three heuristics bridges the gap. Switching from FP32 to FP16 at the end of training does not affect the accuracy.
|ResNet-50||4.4 min||76.21||92.97||13.3 min||75.87||92.70|
|Inception-V3||8 min||77.50||93.60||19.8 min||77.32||93.43|
|MobileNet||3.7 min||71.90||90.47||6.2 min||69.03||88.71|
|+ LR warmup||76.03||92.81||75.93||92.84|
|+ No bias decay||76.16||92.97||76.03||92.86|
A model tweak is a minor adjustment to the network architecture, such as changing the stride of a particular convolution layer. Such a tweak often barely changes the computational complexity but might have a non-negligible effect on the model accuracy. In this section, we will use ResNet as an example to investigate the effects of model tweaks.
We will briefly present the ResNet architecture, especially its modules related to the model tweaks. For detailed information please refer to He et al. . A ResNet network consists of an input stem, four subsequent stages and a final output layer, which is illustrated in Figure 1. The input stem has a convolution with an output channel of 64 and a stride of 2, followed by a max pooling layer also with a stride of 2. The input stem reduces the input width and height by 4 times and increases its channel size to 64.
Starting from stage 2, each stage begins with a downsampling block, which is then followed by several residual blocks. In the downsampling block, there are path A and path B. Path A has three convolutions, whose kernel sizes are , and , respectively. The first convolution has a stride of 2 to halve the input width and height, and the last convolution’s output channel is 4 times larger than the previous two, which is called the bottleneck structure. Path B uses a convolution with a stride of 2 to transform the input shape to be the output shape of path A, so we can sum outputs of both paths to obtain the output of the downsampling block. A residual block is similar to a downsampling block except for only using convolutions with a stride of 1.
One can vary the number of residual blocks in each stage to obtain different ResNet models, such as ResNet-50 and ResNet-152, where the number presents the number of convolutional layers in the network.
Next, we revisit two popular ResNet tweaks, we call them ResNet-B and ResNet-C, respectively. We propose a new model tweak ResNet-D afterwards.
This tweak first appeared in a Torch implementation of
This tweak first appeared in a Torch implementation of ResNet and then adopted by multiple works [7, 12, 27]. It changes the downsampling block of ResNet. The observation is that the convolution in path A ignores three-quarters of the input feature map because it uses a kernel size with a stride of 2. ResNet-B switches the strides size of the first two convolutions in path A, as shown in Figure (a)a, so no information is ignored. Because the second convolution has a kernel size , the output shape of path A remains unchanged.
This tweak was proposed in Inception-v2  originally, and it can be found on the implementations of other models, such as SENet , PSPNet , DeepLabV3 , and ShuffleNetV2 . The observation is that the computational cost of a convolution is quadratic to the kernel width or height. A convolution is 5.4 times more expensive than a convolution. So this tweak replacing the convolution in the input stem with three conservative convolutions, which is shown in Figure (b)b, with the first and second convolutions have their output channel of 32 and a stride of 2, while the last convolution uses a 64 output channel.
Inspired by ResNet-B, we note that the convolution in the path B of the downsampling block also ignores 3/4 of input feature maps, we would like to modify it so no information will be ignored. Empirically, we found adding a average pooling layer with a stride of 2 before the convolution, whose stride is changed to 1, works well in practice and impacts the computational cost little. This tweak is illustrated in Figure (c)c.
|ResNet-50||25 M||3.8 G||76.21||92.97|
|ResNet-50-B||25 M||4.1 G||76.66||93.28|
|ResNet-50-C||25 M||4.3 G||76.87||93.48|
|ResNet-50-D||25 M||4.3 G||77.16||93.52|
Suggested by the results, ResNet-B receives more information in path A of the downsampling blocks and improves validation accuracy by around compared to ResNet-50. Replacing the convolution with three ones gives another improvement. Taking more information in path B of the downsampling blocks improves the validation accuracy by another . In total, ResNet-50-D improves ResNet-50 by .
On the other hand, these four models have the same model size. ResNet-D has the largest computational cost, but its difference compared to ResNet-50 is within 15% in terms of floating point operations. In practice, we observed ResNet-50-D is only 3% slower in training throughput compared to ResNet-50.
In this section, we will describe four training refinements that aim to further improve the model accuracy.
Learning rate adjustment is crucial to the training. After the learning rate warmup described in Section 3.1, we typically steadily decrease the value from the initial learning rate. The widely used strategy is exponentially decaying the learning rate. He et al.  decreases rate at 0.1 for every 30 epochs, we call it ‘‘step decay’’. Szegedy et al.  decreases rate at 0.94 for every two epochs.
In contrast to it, Loshchilov et al.  propose a cosine annealing strategy. An simplified version is decreasing the learning rate from the initial value to 0 by following the cosine function. Assume the total number of batches is (the warmup stage is ignored), then at batch , the learning rate is computed as:
where is the initial learning rate. We call this scheduling as ‘‘cosine’’ decay.
The comparison between step decay and cosine decay are illustrated in Figure (a)a. As can be seen, the cosine decay decreases the learning rate slowly at the beginning, and then becomes almost linear decreasing in the middle, and slows down again at the end. Compared to the step decay, the cosine decay starts to decay the learning since the beginning but remains large until step decay reduces the learning rate by 10x, which potentially improves the training progress.
The last layer of a image classification network is often a fully-connected layer with a hidden size being equal to the number of labels, denote by , to output the predicted confidence scores. Given an image, denote by the predicted score for class . These scores can be normalized by the softmax operator to obtain predicted probabilities. Denote by the output of the softmax operator , the probability for class , , can be computed by:
It’s easy to see and , so is a valid
is a valid probability distribution.
On the other hand, assume the true label of this image is , we can construct a truth probability distribution to be if and 0 otherwise. During training, we minimize the negative cross entropy loss
to update model parameters to make these two probability distributions similar to each other. In particular, by the way how is constructed, we know . The optimal solution is while keeping others small enough. In other words, it encourages the output scores dramatically distinctive which potentially leads to overfitting.
The idea of label smoothing was first proposed to train Inception-v2 . It changes the construction of the true probability to
where is a small constant. Now the optimal solution becomes
where can be an arbitrary real number. This encourages a finite output from the fully-connected layer and can generalize better.
When , the gap will be and as increases, the gap decreases. Specifically when , all optimal will be identical. Figure (a)a shows how the gap changes as we move , given for ImageNet dataset.
We empirically compare the output value from two ResNet-50-D models that are trained with and without label smoothing respectively and calculate the gap between the maximum prediction value and the average of the rest. Under and , the theoretical gap is around 9.1. Figure (b)b demonstrate the gap distributions from the two models predicting over the validation set of ImageNet. It is clear that with label smoothing the distribution centers at the theoretical value and has fewer extreme values.
In knowledge distillation , we use a teacher model to help train the current model, which is called the student model. The teacher model is often a pre-trained model with higher accuracy, so by imitation, the student model is able to improve its own accuracy while keeping the model complexity the same. One example is using a ResNet-152 as the teacher model to help training ResNet-50.
During training, we add a distillation loss to penalize the difference between the softmax outputs from the teacher model and the learner model. Given an input, assume is the true probability distribution, and and are outputs of the last fully-connected layer of the student model and the teacher model, respectively. Remember previously we use a negative cross entropy loss to measure the difference between and , here we use the same loss again for the distillation. Therefore, the loss is changed to
where is the temperature hyper-parameter to make the softmax outputs smoother thus distill the knowledge of label distribution from teacher’s prediction.
In Section 2.1 we described how images are augmented
before training. Here we consider another augmentation method called
mixup . In mixup, each time we randomly
sample two examples and . Then we form a new example
by a weighted linear interpolation of these two examples:
. Then we form a new example by a weighted linear interpolation of these two examples:
where is a random number drawn from the distribution. In mixup training, we only use the new example .
|+ cosine decay||77.91||93.81||78.19||94.06||72.83||91.00|
|+ label smoothing||78.31||94.09||78.40||94.13||72.93||91.14|
|+ distill w/o mixup||78.67||94.36||78.26||94.01||71.97||90.89|
|+ mixup w/o distill||79.15||94.58||78.77||94.39||73.28||91.30|
|+ distill w/ mixup||79.29||94.63||78.34||94.16||72.51||91.02|
Now we evaluate the four training refinements. We set
for label smoothing by
following Szegedy et al. . For the model distillation we use ,
specifically a pretrained ResNet-152-D model with both cosine decay and label smoothing applied
is used as the teacher. In the mixup training, we choose
in the Beta distribution and increase the number of epochs from 120
to 200 because the mixed examples ask for a longer training progress to converge better.
When combining the mixup training with distillation, we train the teacher model with mixup as well.
in the Beta distribution and increase the number of epochs from 120 to 200 because the mixed examples ask for a longer training progress to converge better. When combining the mixup training with distillation, we train the teacher model with mixup as well.
We demonstrate that the refinements are not only limited to ResNet architecture or the ImageNet dataset. First, we train ResNet-50-D, Inception-V3 and MobileNet on ImageNet dataset with refinements. The validation accuracies for applying these training refinements one-by-one are shown in Table 6. By stacking cosine decay, label smoothing and mixup, we have steadily improving ResNet, InceptionV3 and MobileNet models. Distillation works well on ResNet, however, it does not work well on Inception-V3 and MobileNet. Our interpretation is that the teacher model is not from the same family of the student, therefore has different distribution in the prediction, and brings negative impact to the model.
To support our tricks is transferable to other dataset, we train a ResNet-50-D model on MIT Places365 dataset with and without the refinements. Results are reported in Table 7. We see the refinements improve the top-5 accuracy consistently on both the validation and test set.
|Model||Val Top-1 Acc||Val Top-5 Acc||Test Top-1 Acc||Test Top-5 Acc|
Transfer learning is one major down-streaming use case of trained image classification
models. In this section, we will investigate if these improvements discussed
so far can benefit transfer learning. In particular, we pick two important
computer vision tasks, object detection and semantic segmentation, and evaluate
their performance by varying base models.
Transfer learning is one major down-streaming use case of trained image classification models. In this section, we will investigate if these improvements discussed so far can benefit transfer learning. In particular, we pick two important computer vision tasks, object detection and semantic segmentation, and evaluate their performance by varying base models.
|+ distill w/o mixup||78.67||80.96|
|+ mixup w/o distill||79.16||81.10|
|+ distill w/ mixup||79.29||81.33|
The goal of object detection is to locate bounding boxes of objects in an image. We evaluate performance using PASCAL VOC . Similar to Ren et al. , we use union set of VOC 2007 trainval and VOC 2012 trainval for training, and VOC 2007 test for evaluation, respectively. We train Faster-RCNN  on this dataset, with refinements from Detectron  such as linear warmup and long training schedule. The VGG-19 base model in Faster-RCNN is replaced with various pretrained models in the previous discussion. We keep other settings the same so the gain is solely from the base models.
Mean average precision (mAP) results are reported in Table 8. We can observe that a base model with a higher validation accuracy leads to a higher mAP for Faster-RNN in a consistent manner. In particular, the best base model with accuracy 79.29% on ImageNet leads to the best mAP at 81.33% on VOC, which outperforms the standard model by 4%.
|+ distill w/o mixup||78.67||78.97||38.90|
|+ mixup w/o distill||79.16||78.47||37.99|
|+ mixup w/ distill||79.29||78.72||38.40|
Semantic segmentation predicts the category for every pixel from the input images. We use Fully Convolutional Network (FCN)  for this task and train models on the ADE20K  dataset. Following PSPNet  and Zhang et al. , we replace the base network with various pre-trained models discussed in previous sections and apply dilation network strategy [2, 28] on stage-3 and stage-4. A fully convolutional decoder is built on top of the base network to make the final prediction.
Both pixel accuracy (pixAcc) and mean intersection over union (mIoU) are reported in Table 9. In contradiction to our results on object detection, the cosine learning rate schedule effectively improves the accuracy of the FCN performance, while other refinements provide suboptimal results. A potential explanation to the phenomenon is that semantic segmentation predicts in the pixel level. While models trained with label smoothing, distillation and mixup favor soften labels, blurred pixel-level information may be blurred and degrade overall pixel-level accuracy.
In this paper, we survey a dozen tricks to train deep convolutional neural networks to improve model accuracy. These tricks introduce minor modifications to the model architecture, data preprocessing, loss function, and learning rate schedule. Our empirical results on ResNet-50, Inception-V3 and MobileNet indicate that these tricks improve model accuracy consistently. More excitingly, stacking all of them together leads to a significantly higher accuracy. In addition, these improved pre-trained models show strong advantages in transfer learning, which improve both object detection and semantic segmentation. We believe the benefits can extend to broader domains where classification base models are favored.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249--256, 2010.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.
Places: A 10 million image database for scene recognition.IEEE transactions on pattern analysis and machine intelligence, 2017.