Official implementation of "Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network"
Recent studies in image classification have demonstrated a variety of techniques for improving the performance of Convolutional Neural Networks (CNNs). However, attempts to combine existing techniques to create a practical model are still uncommon. In this study, we carry out extensive experiments to validate that carefully assembling these techniques and applying them to a basic CNN model in combination can improve the accuracy and robustness of the model while minimizing the loss of throughput. For example, our proposed ResNet-50 shows an improvement in top-1 accuracy from 76.3 mCE improvement from 76.0 With these improvements, inference throughput only decreases from 536 to 312. The resulting model significantly outperforms state-of-the-art models with similar accuracy in terms of mCE and inference throughput. To verify the performance improvement in transfer learning, fine grained classification and image retrieval tasks were tested on several open datasets and showed that the improvement to backbone network performance boosted transfer learning performance significantly. Our approach achieved 1st place in the iFood Competition Fine-Grained Visual Recognition at CVPR 2019, and the source code and trained models are available at https://github.com/clovaai/assembled-cnnREAD FULL TEXT VIEW PDF
Convolutional Neural Networks (CNNs) have achieved superior performance ...
Much of the recent progress made in image classification research can be...
Attention plays a critical role in human visual experience. Furthermore,...
We describe in this paper Hydra, an ensemble of convolutional neural net...
Existing methods on visual emotion analysis mainly focus on coarse-grain...
We present a simple and effective architecture for fine-grained visual
Big neural networks trained on large datasets have advanced the
Official implementation of "Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network"
Since the introduction of AlexNet [krizhevsky2012imagenet], many studies have mainly focused on designing new network architectures for image classification to increase accuracy. For example, new architectures such as Inception [szegedy2015going], ResNet [he2016deep], DenseNet [huang2017densely], NASNet [zoph2018learning], MNASNet [tan2019mnasnet] and EfficientNet [tan2019efficientnet] have been proposed. Inception introduced new modules into the network with convolution layers of different kernel sizes. ResNet utilized the concept of skip connection, and DenseNet added dense feature connections to boost the performance of the model. In addition, in the area of AutoML, network design was automatically decided to create models such as NASNet and MNASNet. EfficientNet proposes an efficient network by balancing the resolution, height, and width of the network. The resulting performance of EfficientNet for ImageNet top-1 accuracy was greatly improved relative to AlexNet.
|EfficientNet B4 [tan2019efficientnet]+AutoAugment [cubuk2018autoaugment]||83.0||60.7||95|
|EfficientNet B6 [tan2019efficientnet]+AutoAugment [cubuk2018autoaugment]||84.2||60.6||28|
|EfficientNet B7 [tan2019efficientnet]+AutoAugment [cubuk2018autoaugment]||84.5||59.4||16|
|ResNet-50 [he2016deep] (baseline)||76.3||76.0||536|
Unlike these studies which focus on designing new network architecture, He et al. [he2019bag] propose different approaches to improve model performance. They noted that performance can be improved not only through changes in the model structure, but also through other aspects of network training such as data preprocessing, learning rate decay, and parameter initialization. They also demonstrate that these minor "tricks" play a major part in boosting model performance when applied in combination. As a result of using these tricks, ImageNet validation top-1 accuracy of ResNet-50 improved from 75.3% to 79.29%. This improvement is hugely significant, and shows as much performance improvement as improvements to network design. Thus, it is of critical importance to combine these existing techniques.
Inspired by their works, we conducted a more extensive and systematic study of assembling several CNN-related techniques into a single network. When considering the many techniques that have been introduced, we first divided the techniques into two categories: network tweaks and regularization. Network tweaks are methods that modify the CNN architectures to be more efficient. For example, a representative work is SENet [hu2018squeeze]. Regularization is a method that prevents overfitting by increasing the training data through data augmentation processes such as AutoAugment [cubuk2018autoaugment] and Mixup [zhang2017mixup], or by limiting the complexity of the CNN with processes such as Dropout [srivastava2014dropout], and DropBlock [dai2018batch]. Furthermore, we systematically analyze the process of assembling these two types of techniques through extensive experiments and demonstrate that our approach leads to significant performance improvements.
In addition to top-1 accuracy, mCE and throughput were used as performance indicators for combining these various techniques. Hendrycks et al. [hendrycks2019benchmarking] proposed mCE (mean corruption error), which is a measure of network robustness against input image corruption. Moreover, we used throughput (images/sec) instead of the commonly used measurement of FLOPS (floating point operations per second) because we observed that FLOPS is not proportional to the inference speed of the actual GPU device. Detailed experiments on the discrepancy between the FLOPS and the throughput of GPU devices are included in the Appendix A.
Our contributions can be summarized as follows:
By organizing the existing CNN-related techniques for image classification, we find techniques that can be assembled into a single CNN. We then demonstrate that our resulting model surpasses the state-of-the-art models with similar accuracy in terms of mCE and inference throughput (Table 1).
We provide detailed experimental results for the process of assembling CNN techniques and release the code for accessibility and reproducibility.
Before introducing our approach, we describe default experimental settings and evaluation metrics used in Sections3 and 4.
We use the official TensorFlow[abadi2016tensorflow] ResNet 333https://github.com/tensorflow/models as base code. We use the ImageNet ILSVRC-2012 [russakovsky2015imagenet] dataset, which has 1.3M training images and 1,000 classes. All models were trained on a single machine with 8 Nvidia Tesla P40 GPUs compatible with the CUDA 10 platform and cuDNN 7.6. TensorFlow version 1.14.0 was used.
The techniques proposed by He et al. [he2019bag] are basically applied to all our models described in Section 3
. We briefly describe the default hyperparameters and training techniques as follows.
In the training phase, a rectangular region is randomly cropped using a randomly sampled aspect ratio from 3/4 to 4/3, and the fraction of cropped area over whole image is randomly chosen from 5% to 100%. Then, the cropped region is resized as a 224x224 square image flipped horizontally with a random probability of 0.5. During validation, we first resize the shorter size of each image to 256 pixels while the aspect ratio is maintained. Next, we center crop the image to the 224x224 size and normalize the RGB channels, identically to training.
We use 1,024 batch sizes for training. In our study, this is close to the maximum size that can be received on a single machine with 8 P40 GPUs. The initial learning rate is 0.4 and weight decay is set to 0.0001. The default number of training epochs is 120, but some techniques require a different number of epochs, which is given explicitly when necessary. Stochastic gradient descent with momentum 0.9 is used as the optimizer.
Learning rate warmup If a large batch size is set, using a high learning rate may result in numerical instability. Goyal et al. [goyal2017accurate] proposes a warmup strategy that linearly increases the learning rate from 0 to the initial learning rate at warm-up periods set to first 5 epochs.
Zero We initialize for all batch-norm layers that sit at the end of a residual block. Therefore, all the residual blocks return only their shortcut branch result in the early stages of training. It is easier to train by creating an effect that shrinks the entire layer at the initial stage.
Mixed-precision floating point We only use mixed-precision floating point in the training phase because mixed-precision accelerates the overall training speed if the GPU supports it [micikevicius2017mixed]. In our study, training speed for mixed-precision is 1.2 times faster than FP32 on an Nvidia P40 GPU, and is twice as fast as FP32 on an Nvidia V100 GPU. However, this does not result in improvement in top-1 accuracy.
Cosine learning rate decay (cosine) The cosine decay [loshchilov2016sgdr] starts at a low rate from the beginning of training, and then drops to a large rate in the middle and again at a small rate in the end.
The selection of metrics used to measure the performance of the model is important because it indicates the direction in which the model is developed. We use the following three metrics as key indicators of model performance.
The top-1 is a measure of classification accuracy on the ImageNet ILSVRC-2012 [russakovsky2015imagenet] validation dataset. The validation dataset consists of 50,000 images of 1,000 classes.
Throughput is defined as how many images are processed per second on the GPU device. We measured inference throughput for an Nvidia P40 1 GPU. For comparison with other models, we used FP32 instead of FP16 in our experiments, using a batch size of 64.
The mean corruption error (mCE) was proposed by Hendrycks et al. [hendrycks2019benchmarking] to measure the performance of the classification model on corrupted images.
In this section, we introduce various network tweaks and regularization techniques to be assembled, and describe the details of the implementation. We also perform preliminary experiments to study the effect of different parameter choices.
Figure 1 shows the overall flow of our final ResNet-50 model. Various network tweaks were applied to vanilla ResNet. The network tweaks we used are as follows.
ResNet-D is a minor adjustment to the vanilla ResNet network architecture model proposed by He et al. [he2019bag]. It is known to work well in practice and has little impact to computational cost [he2019bag]. Three changes were added to the ResNet model, as illustrated in Figure 2
. First, the stride sizes of the first two convolutions have been switched. (Blue in Figure2(b)). Second, a 2×2 average pooling layer has been added with a stride of 2 before the convolution (Green). Last, a large 7x7 convolution has been replaced with three smaller 3x3 convolutions in Stem layer (Red).
We examine two tweaks in relation to channel attention. First, Squeeze and Excitation (SE) network [hu2018squeeze] focuses on enhancing the representational capacity of the network by modeling channel-wise relationships. SE eliminates spatial information by global pooling to get channel information only, and then two fully connected layers in this module learn the correlation between channels. Second, Selective Kernel (SK) [li2019selective]
is used, is inspired by the fact that the receptive sizes of neurons in the human visual cortex are different from each other. SK unit has multiple branches with different kernel sizes, and all branches are fused using softmax attention.
The original SK generates multiple paths with 3x3 and 5x5 convolutions, but we instead use two 3x3 convolutions to split the given feature map. This is because two convolutions of the same kernel size can be replaced by a convolution with twice as many channels, thereby lowering the inference cost. Figure 3 shows an SK unit that replaces two branches with one convolution operation.
|C2||R50+SK||3x3 + 5x5||2||78.00||326|
Table 2 shows the results for different configurations of channel attention. Compared with SK, SE has higher throughput but lower accuracy (C1 in Table 2). Comparing R50+SK (C3) to R50+SK with 3x3 and 5x5 kernels (C2), the top-1 accuracy only differs by 0.08% (78.00 and 77.92), but the throughput is significantly different (326 and 382). Considering the trade-off between accuracy and throughput, we used one 3x3 kernel with doubled channel size instead of 3x3 and 5x5 kernels. Comparing C3 and C4, we see that changing the setting of reduction ratio for SK units from 2 to 16 yields a large degradation of top-1 accuracy relative to throughput improvement. Applying both SE and SK (C5) not only decreases accuracy by 0.42 (from 77.92 to 77.50), but also decreases inference throughput by 37 (from 382 to 345). For a better trade-off between top-1 accuracy and throughput, R50+SK is preferred.
CNN models for image classification are known to be very vulnerable to small amounts of distortion [xie2019adversarial]. Zhang et al. [zhang2019shiftinvar]
proposes AA to improve the shift-equivariance of deep networks. The max-pooling is commonly viewed as a competing downsampling strategy, and is inherently composed of two operations. The first operation is to densely evaluate the max operator and second operation is naive subsampling[zhang2019shiftinvar]. AA is proposed as a low-pass filter between them to achieve practical anti-aliasing with any existing strided layer such as strided-conv.
In the original paper, AA applies to max-pooling, projection-conv, and strided-conv of ResNet. In addition, the smoothing factor can be adjusted by changing the blur kernel filter size, where a larger filter size results in increased blur. Table 3 shows the experimental results for AA. We observed that reducing the filter size from 5 to 3 maintains the top-1 accuracy while increasing inference throughput (A1,2 in Table 3). However, removing the AA applied to the projection-conv does not affect the accuracy (A3). We also observe that applying AA to max-pooling degrades throughput significantly (A1-3). Finally, we apply AA only to strided-conv in our model (Green in Figure 1).
BigLittleNet [chen2018big] applies multiple branches with different resolutions while aiming at reducing computational cost and increasing accuracy. The Big-Branch has the same structure as the baseline model and operates at a low image resolution, whereas the Little-Branch reduces the convolutional layers and operates at same image resolution as the baseline model.
BigLittleNet has two hyperparameters, and , which adjust the width and depth of the Little-Branch, respectively. We use = and = for ResNet-50 and use = and = for ResNet-152. The left small branch in Figure 1 represents the Little-Branch. The Little-Branch has one residual block and is smaller in width than the main Big-Branch.
is a data augmentation procedure which learns augmentation strategies from data. It uses reinforcement learning to choose a sequence of image augmentation operations with the best accuracy by searching in a discrete search space of their probability of application and magnitude. We borrow the augmentation policy found by Autoaug on ImageNet ILSVRC-2012444https://github.com/tensorflow/models/tree/master/research/autoaugment.
creates one example by interpolating two examples of the training set for data augmentation. Neural networks are known to memorize training data rather than generalizing from the data[zhang2016understanding]. As a result, the neural network produces unexpected outputs when it encounters data that differs from the distribution of the training set. Mixup mitigates these problems by showing the neural network interpolated examples of filling empty space on the feature space of the training dataset.
Mixup has two types of implementation. The first type uses two mini batches to create a mixed mini batch. this type of implementation is suggested in the original paper [zhang2017mixup]. The second type uses a single mini batch to create the mixed mini batch by mixing the mini batch with a shuffled clone of itself. The second type of implementation uses less CPU resources because only one mini batch needs to be preprocessed to create one mixed mini batch. However, experiments show that the second type of implementation reduces top-1 accuracy (Table 4). Therefore, in later experiments, we use the first type of implementation. We set the Mixup hyperparameter to 0.2.
|R50D||+ Mixup (type=2)||78.85|
|R50D (E3)||+ Mixup (type=1)||79.10|
Dropout [srivastava2014dropout] is a popular technique for regularizing deep neural networks. It prevents the network from over-fitting the training set by dropping neurons at random. However, Dropout does not work well for extremely deep networks such as ResNet [ghiasi2018dropblock]. DropBlock [ghiasi2018dropblock] can remove specific semantic information by dropping a continuous region of activations. Thus, it is efficient for the regularization of very deep networks. We borrow the same DropBlock setting used in the original paper. We apply DropBlock to Stages 3 and 4 for ResNet-50 and linearly decay the hyperparameter from 1.0 to 0.9 during training.
In the classification problem, class labels are expressed as one hot encoding. If CNN trains to minimize cross entropy with this hard one hot encoding target, the logits of the last fully connected layer of CNN grow to infinity, which leads to over-fitting[he2019bag]. Label smoothing [pereyra2017regularizing] suppresses infinite output and prevents over-fitting. We set the label smoothing factor to 0.1.
Knowledge Distillation [hinton2015distilling] is a technique for transferring knowledge from one neural network (teacher) to another (student). Teacher models are often complex but cumbersome models with high accuracy, and a weak but light student model can improve its own accuracy by mimicking a teacher model. The hyperparameter of KD was said to be optimal when set to or in the originally paper [hinton2015distilling], but we use = for our model. Because our model uses Mixup and KD techniques together, the teacher network should also be applied to Mixup. This leads to better performance at lower temperatures because the teacher’s signal itself is already smoothed by the Mixup (Table 5). We used AmoebaNet-A as a teacher with 83.9% of ImageNet validation top-1 accuracy.
|R50D+SK||+ KD (T=2)||81.47|
|R50D+SK||+ KD (T=1.5)||81.50|
|R50D+SK (E7)||+ KD (T=1)||81.69|
|EfficientNet B0 [tan2019efficientnet]||Autoaug||-||224||77.3||-||70.7||-||510|
|EfficientNet B1 [tan2019efficientnet]||Autoaug||-||240||79.2||-||65.1||-||352|
|EfficientNet B2 [tan2019efficientnet]||Autoaug||-||260||80.3||-||64.1||-||279|
|EfficientNet B3 [tan2019efficientnet]||Autoaug||-||300||81.7||-||62.9||-||182|
|EfficientNet B4 [tan2019efficientnet]||Autoaug||-||380||83.0||-||60.7||-||95|
|EfficientNet B5 [tan2019efficientnet]||Autoaug||-||456||83.7||-||62.3||-||49|
|EfficientNet B6 [tan2019efficientnet]||Autoaug||-||528||84.2||-||60.6||-||28|
|EfficientNet B7 [tan2019efficientnet]||Autoaug||-||600||84.5||-||59.4||-||16|
Adding ResNet-D to the baseline model improves top-1 accuracy by 0.5% (from 76.87 to 77.37) (M1 in Table 6), and adding SK tweaks improves accuracy by 1.46% (from 77.37 to 78.83) (M2). In Table 2, We show that the accuracy is increased by 1.62% when SK is independently applied to ResNet (from 76.30 to 77.92). Stacking ResNet-D and SK increases the top-1 accuracy gains almost in equal measure to the sum of the performance gains of applying ResNet-D and SK separately. The results show that the two tweaks can improve performance independently with little effect on each other. Applying BL to R50D+SK improves top-1 accuracy by 0.44% (from 79.27 to 78.83) (M3). To achieve higher accuracy while maintaining throughput similar to that of the R50D+SK, we use a 256x256 image resolution for inference, whereas we use 224x224 image resolution for training. Applying AA to the R50D+SK+BL improves top-1 accuracy by 0.12% (from 79.27 to 79.39) and decreases throughput by 47 (from 359 to 312) (M4). Because AA is a network structure designed for robustness to image distortion, the top-1 accuracy does not reliably determine the AA effect. The effect on AA is further shown in the next section, wherein mCE is introduced to evaluate models.
The ablation study described in Table 7 shows the impact of assembling the techniques described in Section 3.2. We stack the network tweaks and regularizations alternately to balance the performance effects.
The regularization techniques increase both accuracy and mCE, but the performance improvement effect of mCE is greater than the improvement of accuracy (E2,3,5,7,11). For example, applying Mixup, DropBlock, KD, and Autoaug individually improves top1/mCE 0.75%/6.08%, 0.69%/1.84%, 0.29%/1.26%, and 0.09%/4.14% respectively. It can be seen that regularization help to make CNNs more robust to image distortions.
Adding SE improves top-1 accuracy by 0.61% and improves mCE by 3.71% (E4). SE having a greater effect on mCE enhancement than top-1 accuracy is similar to the result of the regularization techniques. We confirm that channel attention is also helpful for robustness to image distortion.
Replacing SE with SK improves performance by 1.0% and 4.3% for the top-1 and mCE, respectively (E6). In Table 2, when SE is changed to SK without regularization, the accuracy increases by 0.5%. Compared to SK without regularization, replacing SE with SK with regularization leads to nearly double the accuracy improvement. This means that SK is more complementary for regularization techniques than SE.
Changing the epochs from 270 to 600 improves performance (E8). Because data augmentation and regularization are stacked, they have a stronger effect of regularization, so longer epochs seems to yield better generalization performance. BL shows a high performance improvement not only on top-1, but also on mCE, and without inference throughput loss (E9). AA also shows higher performance gain in mCE relative to top-1 (E10), which agrees with AA being used as a network tweak to make the CNN robust for image translations as claimed in [zhang2019shiftinvar].
The assembled model of all the techniques described so far has a top-1 accuracy of 82.78% and an mCE of 48.89%. This final model is listed in Table 7 as E11, and we call this model Assemble-ResNet-50. We also experiment with ResNet-152 for comparison as E12, we call this model Assemble-ResNet-152.
|Food-101||EfficientNet B7 [tan2019efficientnet] 93.0||87.8||87.0||92.5|
|Stanford Cars||EfficientNet B7 [tan2019efficientnet] 94.7||91.7||89.1||94.4|
|Oxford-Flowers||EfficientNet B7 [tan2019efficientnet] 98.8||97.5||96.1||98.9|
|FGVC Aircraft||EfficientNet B7 [tan2019efficientnet] 92.9||86.6||78.8||92.4|
|Oxford-IIIT Pets||AmoebaNet-B [huang2019gpipe] 95.9||91.5||92.5||94.3|
In this section, we will investigate whether these improvements discussed so far can help with transfer learning. Before that, we first need to analyze the contribution of transfer learning for each technique. To do this, we performed an ablation study on the Food-101 [bossard2014food] dataset, which is the largest public fine-grained visual classification (FGVC) dataset. The basic experiment setup and hyperparameters that differ from the backbone training are:
Initial learning rate reduced from to .
Weight decay is set to 0.001.
Momentum for BN is set to .
Keep probability of DropBlock starts at and decreases linearly to at the end of training
The training epoch is set differently for each dataset and is indicated in Appendix B.
As shown in Table 8, stacking network tweaks and regularization techniques have steadily improved both top-1 accuracy and mCE for the transfer learning task on the Food-101 dataset. In particular, comparing the experiments F4-F8 with experiments F9-F13 (in Table 8) shows the effect of regularization on the backbone. We use the same network structure in F4-F13, but for F9-F13, they have regularization such as Mixup, DropBlock, KD and Autoaug on the backbone. This regularization of the backbone gives performance improvements for top-1 accuracy as expected. On the other hand, the aspect of mCE performance differed from the top-1 accuracy. Without regularization during fine-tuning such as in F4 and F9, the backbone with regularization leads to better mCE performance than backbone without regularization. However, adding regularization during fine-tuning narrows the mCE performance gap (F5-8 and F10-13). For convenience, we call the final F13 model in Table 8 as Assemble-ResNet-FGVC-50.
We also have evaluated Assemble-ResNet-FGVC-50 in Table 8 on the following datasets: Stanford Cars [krause20133d], CUB-200-201l [wah2011caltech], Oxford 102 Flowers [nilsback2008automated], Oxford-IIIT Pets [parkhi2012cats], FGVC-Aircraft [maji2013fine], and Food-101 [bossard2014food]. The statistics for each dataset are as shown in Appendix C. We borrow the same training settings from Kornblith et al. [kornblith2019better] and fine-tuned new datasets from Assemble-ResNet-50 ImageNet checkpoint.
Table 9 shows the transfer learning performance. Compared to EfficientNet [tan2019efficientnet] and AmoebaNet-B [huang2019gpipe] which are state-of-the-art model for image classification tasks. Our Assemble-ResNet-FGVC-50 model achieves comparable accuracy with 20x faster inference throughput than the existing state-of-the-art models.
|S7||R50D+SK + REG||85.2|
|S8||R50D+SK + REG||DropBlock||85.9|
|S9||R50D+SK + REG||DropBlock+Autoaug||84.0|
We also conducted an ablation study on three standard fine-grained image retrieval (IR) datasets: Stanford Online Products (SOP) [song2016deep], CUB200 [wah2011caltech] and CARS196 [krause20133d]. We borrow the zero-shot data split protocol from [song2016deep].
The basic experiment setup and hyperparameters are as follows.
Image preprocessing resizes to 224x224 without maintaining aspect ratio with probability 0.5 and resize to 256x256 and random crop to 224x224 with probability 0.5.
Data augmentation includes random flip with 0.5 probability.
Momentum for BN is set to .
Weight decay is set to 0.0005.
The training epoch, batch size, learning rate decay and assembling configuration is set differently for each dataset. We will describe the settings in the Appendix D.
On top of that, cosine-softmax based losses were used for image retrieval. In this work, we use ArcFace [deng2019arcface] loss with a margin of 0.3 and use generalized mean-pooling (GeM) [radenovic2018fine] for a pooling method without performing downsampling at Stage 4 of backbone networks because it has better performance for the image retrieval task.
In the case of SOP, the degree of the effect was examined by an ablation study with the results listed in Table 10. In contradiction to our results on FGVC, the particular combination of network tweaks and regularization that worked well on the SOP dataset were different from that for FGVC datasets. Comparing S2-4, we see that BL and AA did not work well on the SOP dataset. Of the regularizers, DropBlock works well, but Autoaug do not improve the recall@1 performance (S2 and S5,6). Nevertheless, in the best configuration, there was a significant performance improvement of 3.0% compared to the baseline ResNet-50.
The recall at 1 for image retrieval datasets (recall@1) are reported in Table 11. There is also a significant performance improvement on CUB200 and CARS196 datasets.
In this paper, we show that assembling various techniques for CNNs to single convolutional networks leads to improvements of top-1 accuracy and mCE on the ImageNet ILSVRC2012 validation dataset. Synergistic effects have been achieved by using a variety of network tweaks and regularization techniques together in a single network. Our approach has also improved performance consistently on transfer learning such as FGVC and image retrieval tasks. More excitingly, our network is not frozen, but is still evolving, and can be further developed with future research. For example, we already are planning to reassemble various new studies such as AugMix [hendrycks2019augmix] and ECA-net [wang2019ecanet], which were recently published. We expect that there will be further improvements if we change the vanilla backbone to a more powerful backbone such as EfficientNet [tan2019efficientnet] instead of ResNet, which we leave as future works.
We observe in several experiments that FLOPS is not proportional to the inference speed of the actual GPU.
We use the same hyperparameters for as all datasets as possible for transfer learning. The different parameters settings for each dataset are described in Table 13.
|FGVC Dataset||Training Epochs|
|Dataset||Train Size||Test Size||# Classes|
IR uses a different regularization for each dataset.
We use the same hyperparameters for as all datasets as possible for transfer learning. The parameters set differently for each dataset are described in Table 16.