In order to obtain the best possible performance from Convolutional neural nets (CNNs), the training and testing data distributions should match. However, in image recognition, data pre-processing procedures are often different for training and testing: the most popular practice is to extract a rectangle with random coordinates from the image to artificially increase the amount of training data. This Region of Classification (RoC) is then resized to obtain an image, or crop, of a fixed size (in pixels) that is fed to the CNN. At test time, the RoC is instead set to a square covering the central part of the image, which results in the extraction of a center crop
. Thus, while the crops extracted at training and test time have the same size, they arise from different RoCs, which skews the distribution of data seen by the CNN.
Over the years, training and testing pre-processing procedures have evolved, but so far they have been optimized separately [Ekin2018AutoAugment]. Touvron et al. show [Touvron2019FixRes] that this separate optimization has a detrimental effect on the test-time performance of models. They address this problem with the FixRes method, which jointly optimizes the choice of resolutions and scales at training and test time, while keeping the same RoC sampling.
We apply this method to the recent EfficientNet [tan2019efficientnet] architecture, which offers an excellent compromise between number of parameters and good performance. This short note show that properly combining FixRes and EfficientNet significantly improves the current state of the art [tan2019efficientnet]. Noticeably,
We report the best performance without external data on ImageNet (top1: 85.7%);
We report the best accuracy (top1: 88.5%) with external data on ImageNet;
We report several state-of-the-art compromises between accuracy and number of parameters, see Figure 1.
2 Training with FixRes: updates
Recent research in image classification tends towards larger networks and higher resolution images [Yanping2018GPipe, mahajan2018exploring, Xie2019SelftrainingWN]. For instance, the state-of-the-art in the ImageNet ILSVRC 2012 benchmark is currently held by the EfficientNet-L2 [Xie2019SelftrainingWN] architecture with 480M parameters using 800800 images for training. Similarly, the state-of-the-art model learned from scratch is currently EfficientNet-B8 [Xie2019AdversarialEI] with 88M parameters using 672672 images for training. In this note, we focus on the EfficientNet architecture [tan2019efficientnet] due to its good accuracy/cost trad-off and its popularity.
is routinely employed at training time to improve model generalization and reduce overfitting. In this note, we use the same augmentation setup as in the original FixRes paper [Touvron2019FixRes]. We have only integrated label smoothing to underline their complementarity.
is a very simple method that amounts to re-training the classifier or a few top layers at the target resolution. Therefore, it has several advantages: (1) it is computationally cheap because the back-propagation does not need to be performed on the whole network; (2) it can be applied to any CNN architecture and is complementary with the other tricks mentioned above; (3) it can be applied on a network that comes from an unknown, possibly closed source, that is selected for its performance on low-resolution images.
Therefore, it is easy and natural to experiment with FixRes on the current state-of-the-art EfficientNet CNN. This is what we do in the next section.
We experiment on the ImageNet-2012 benchmark [Russakovsky2015ImageNet12], reporting validation performance as top-1 accuracy.
In this note we use the EfficientNet [tan2019efficientnet] architecture. Mainly these two versions giving the best performance: EfficientNet trained with adversial examples [Xie2019AdversarialEI], and EfficientNet trained with Noisy student [mahajan2018exploring] that are pre-trained in a weakly-supervised fashion on 300 million unlabeled images.
We mostly follow the FixRes [Touvron2019FixRes] training protocol. The only difference is that we combine the FixRes data-augmentation with label smoothing during the fine-tuning stage.
3.3 Comparison with the state of the art
Table 1 and Table 2 compare our results with those of EfficientNet reported in the literature. All our FixEfficientNets outperform the corresponding EfficientNet (see Figure 1). As a result and to the best of our knowledge, our FixEfficientNet-L2 surpasses all other models available in the literature. It achieves 88.5% Top-1 accuracy and 98.7% Top-5 accuracy on the ImageNet-2012 validation benchmark [Russakovsky2015ImageNet12].
FixRes is a method that can improve the performance of any model. It is a method that is applied after the conventional training which gives it a very great flexibility. Indeed, it can be easily integrated into any existing training pipeline. For example, in the article [Xie2019SelftrainingWN] although it is no longer state of the art on ImageNet, they use FixRes to get their best performance.