Temporally Resolution Decrement: Utilizing the Shape Consistency for Higher Computational Efficiency

12/02/2021
by   Tianshu Xie, et al.
NetEase, Inc
0

Image resolution that has close relations with accuracy and computational cost plays a pivotal role in network training. In this paper, we observe that the reduced image retains relatively complete shape semantics but loses extensive texture information. Inspired by the consistency of the shape semantics as well as the fragility of the texture information, we propose a novel training strategy named Temporally Resolution Decrement. Wherein, we randomly reduce the training images to a smaller resolution in the time domain. During the alternate training with the reduced images and the original images, the unstable texture information in the images results in a weaker correlation between the texture-related patterns and the correct label, naturally enforcing the model to rely more on shape properties that are robust and conform to the human decision rule. Surprisingly, our approach greatly improves the computational efficiency of convolutional neural networks. On ImageNet classification, using only 33 training image to 112×112 within 90 from 76.32 the training image to 112 x 112 within 50 78.18

READ FULL TEXT VIEW PDF

Authors

page 4

page 12

10/12/2020

Shape-Texture Debiased Neural Network Training

Shape and texture are two prominent and complementary cues for recognizi...
09/13/2021

Shape-Biased Domain Generalization via Shock Graph Embeddings

There is an emerging sense that the vulnerability of Image Convolutional...
11/13/2020

Fast and Scalable Earth Texture Synthesis using Spatially Assembled Generative Adversarial Neural Networks

The earth texture with complex morphological geometry and compositions s...
10/11/2020

Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification

The accuracy of deep convolutional neural networks (CNNs) generally impr...
09/02/2020

NITES: A Non-Parametric Interpretable Texture Synthesis Method

A non-parametric interpretable texture synthesis method, called the NITE...
06/25/2020

CognitiveCNN: Mimicking Human Cognitive Models to resolve Texture-Shape Bias

Recent works demonstrate the texture bias in Convolutional Neural Networ...
11/28/2013

Shape from Texture using Locally Scaled Point Processes

Shape from texture refers to the extraction of 3D information from 2D im...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Looking back on the last decade, convolutional neural networks (CNNs) have enabled tremendous leaps of progress in computer vision, as reflected by the super-human-level performance on various benchmarks, such as ILSVRC’2012 challenge 

[krizhevsky2012imagenet]

and the COCO detection competition 

[2014Microsoft]. However, the transformational advances usually come at complex network design with increasing depth or width, which would translate into high computation cost and memory footprint. In this regard, how to maintain or even improve CNNs’ performance while reducing the computational cost is undoubtedly worthy research, but it puts forward higher requirements for our understanding to CNNs.

Image resolution is an important factor in training CNNs, which directly relates to the accuracy as well as the computation cost of neural networks. Many efforts have been made on adjusting image resolution [yang2020resolution, touvron2019fixing, wang2020resolution, yang2020resolution, elad2019mix, huang2018multiscale]. For example, to address the discrepancy between the size of the objects during training and testing, Touvron et al.[touvron2019fixing] employ different train and test resolutions via a simple parameter adaptation. Another line of research [wang2020glance, ming2021dynamic, huang2018multiscale, Wang_2019_CVPR, Guo_2020_CVPR] stems from the redundacy

existing in the images, which believes that a large portion of ‘easy’ samples can already classified correctly in low resolution or under small network. For example, DRNet 

[ming2021dynamic] proposes a resolution predictor to ensure each input image to be resized to its optimal resolution. GFNet [wang2020glance] presents a two-stage processing framework, where the images would be progressively processed and then terminated if they can be correctly classified with high confidence.

In this paper, we focus on a novel perspective, consistency of the shape semantics. As shown in the left part of Figure 1, a large letter ‘H’ (shape) is rendered in small copies of some other lowercase letters ‘a’ and ‘b’ (texture). When we zoom out this image by taking one pixel every other pixel, the texture information would be left with only ‘a’, but, more importantly, the shape semantics retains relatively complete (we can still recognize ‘H’ through the reduced image). This provides a clear indication that compared with the changeability and fragility of the texture information, the shape semantics enjoys the consistency property that it remains stable during resolution change.

Figure 1: Left: The change of texture information and shape semantics when the image is reduced. The reduced image would lose large quantities of texture information (a and b), but the shape semantics (H) still retains relatively completely. Right: Illustration of the proposed framework. The correlation between the texture-related patterns and the correct label becomes spurious since the texture-related patterns fail to hold in the same manner as them held in the past, enabling the model to focus on the more stable and invariant shape-related patterns.

Based on the above intuition, we propose a novel training strategy: Temporally Resolution Decrement (TRD). The whole algorithm is surprisingly simple: we randomly select some epochs to train the network with low resolution inputs. As discussed earlier, the reduced images not only enable improved computational efficiency but also provide the potential to help the model integrate spatial shape semantics. Different from other methods of adaptive inference that either require modifications on model architectures or demand extra fine-tuning training, our method only incurs merely one negligible change on the model by replacing the global average pooling to adaptive average pooling. The whole method can be implemented with a few lines of code and directly integrated into almost any CNN with ease.

It’s worth noting that the proposed framework ensures the unity of input resolution in the time domain, i.e. in each training epoch, the model is fed with images with the same resolution. Each epoch of training can be viewed as one time of cognitive process of the model. As shown in the right part of Figure 1, during the iterative cognitive process, the changeability of the texture information undermines the correlation between the texture-related patterns and the correct label. Consequently, the model has to focus on the invariant shape-related patterns, allowing itself to generalize on two different input resolutions. Put another way, such epoch-wise operation gives rise to the reliance of the model on the robust shpae-based representation. This also differentiates our method with other batch-wise methods, such as MixSize [elad2019mix] where the input resolution can vary in each optimization step.

We evaluate the performance of our method on ImageNet with the vanilla ResNet-50 which typically serves as the default architecture in numerous studies. Surprisingly, when saving 77% calculation overhead (randomly reducing the training image to 8484 within 90% epochs), our method still improves ResNet-50 from 76.32% to 77.26%. When saving 43% calculation overhead (randomly reducing the training images to 8484 within 50% epochs), the accuracy can be further improved to 78.18%.

Besides the lower computation cost as well as the improved performance, our method is still appealing in the generalization on varying image resolutions. The model trained with our method not only performs well on the original scale, but also shows remarkable improvements on the images with a wide range of resolutions at test time. It’s important to note that although CNNs are not explicitly designed with the intention of handling varying image resolutions, various fields such as detection or self-driving cars where the objects would not appear in a fixed size may stand to benefit from models with such ability. On the other hand, part of the superiority of human vision system lies in the robustness against resolution change. Children learn about many things of the world through picture books, but they grow up without being confused by varying size of objects.

In summary, we make the following principle contributions in this paper:

  • Inspired from the consistency of shape semantics during image zooming, we propose TRD, an effective yet simple plug-in method suitable for almost any CNN-based model.

  • Albeit simple, our method greatly improves computational efficiency. It breaks through the traditional computation-vs-accuracy trade-off view of deep learning that high computation cost is required to achieve a competitive performance, creating a win-win situation on both sides where lower computation cost and the improved accuracy can be simultaneously obtained without cautious balancing.

  • Our method manifests sweeping generalization on varying image resolutions, leaning close to the human vision system that would seldom be fooled by changes in image size.

2 Related Works

Model acceleration techniques: There is a large volume of studies describing ways to speed up the network inference of deep networks. The most common direction is to design lightweight and efficient network architectures, such as MobileNets [howard2017mobilenets, sandler2018mobilenetv2, howard2019searching], CondenseNet [Huang_2018_CVPR], ShuffleNet [Zhang_2018_CVPR]. A number of recent works also focus on pruning redundant network connections [li2016pruning, liu2017learning, wang2021accelerate, luo2017thinet, tang2021manifold, Guo_2020_CVPR]. Since deep networks typically possess many redundant weights, some other works focus on quantizing the weights in the hope for faster inference [Jacob_2018_CVPR, hubara2016binarized].

Image Resolution: Recently, an increasing amount of literature has focused on adjusting image resolution, an important factor which directly determines the computational costs and the performance of CNNs. Among them, much of the available literature [wang2020glance, ming2021dynamic, huang2018multiscale, Wang_2019_CVPR] deals with the question of spatial redundancy, that not all regions in an image are task-relevant. Wang et al[wang2020glance]

propose to process a sequence of relatively small inputs which are selected from the original image with reinforcement learning instead of original image to perform efficient image classifications. Zhu

et al[ming2021dynamic] involve a resolution predictor that is explored and optimized jointly with the desired network into training to calculate the expected resolution for each input image. Yang et al[yang2020resolution] propose RANet that is composed of sub-networks with different input resolutions, reducing computational cost by avoiding unnecessary computation on high-resolution when samples can be predicted in low-resolution. Huang et al[huang2018multiscale] train a cascade of intermediate classifiers throughout the network which would be adaptively applied during test time to maintain coarse and fine level features.

Touvron et al[touvron2019fixing] propose a training strategy that employs different train and test resolutions, aiming to solve the discrepancy between the size of the objects seen by the classifier at train and test time. However, our method differs from it in two important aspects: (i) Motivation. FixRes focuses on discrepancy and hopes to match the training and testing data distributions, while our method inspires from the consistency of shape semantics and focuses on helping the model learn the shape semantics. (ii) Design. FixRes operates models at much higher resolution at test time or much lower resolution at train time, thus relies on manual fine-tuning for adaptation and test-time augmentations. Our method only randomly reduces the training image resolution without any other operations required. Besides, FixRes is only applicable to the mismatch situation caused by the specific data augmentations (RandomResized Crop and Center Crop in the standard practice of training ImageNet), while our method is applicable to all types of datasets and data augmentation.

Elad et al[elad2019mix] propose MixSize, a stochastic training regime where the image size as well as the batch size would be modified at each optimization step. Our method differs from it in the following two aspects: (i) MixSize involves multiple image sizes at each batch of training, while our method is an epoch-wise operation wherein the image size would remain the same at each epoch of training. (ii) Simplicity. MixSize may require several changes on the project, involving multiple changes on input resolutions, changes on batch size, changes on gradients and on batch-norm layers. Yet our method has shown to improve the computational efficiency with negligible modifications required, which could be implemented in a few lines of code.

Texture bias and shape bias of CNNs: Recent works [brendel2019approximating, geirhos2018imagenet, ballester2016performance, shi2020informative] argue that CNNs have the texture bias: CNNs classify images mainly according to their texture while shape information seems hard for CNNs to understand. A lot of works are proposed in the hope for improving the shape bias for CNNs. Geirhos et al[geirhos2018imagenet] use Stylized-ImageNet to help the model learn the shape-based representation. Li et al[li2020shape] augments training data with images with conflicting shape and texture information for training a debiased model. These methods seek the balance between texture and shape preference all by consuming lots of extra computational overhead to generate transferred images or eliminating texture information, while our method can improve the shape bias of CNNs while reducing the computational cost.

Figure 2: Class activation mapping (CAM) [zhou2016learning] visualizations on samples with two image resolutions (224224 and 112112). ‘Baseline’ denotes the vanilla ResNet-50 model to clearly see the effect of our method. Note that TRD can capture targets more accurately, and shows a relatively uniform perception pattern under different image resolutions.

3 Our Approach

3.1 Algorithm

Temporally Resolution Decrement (TRD) is an effective albeit simple training strategy for CNNs. Instead of training the network with fixed image resolution, we randomly reduce the training image to low resolution in time domain. Specifically, each epoch has a probability of being selected into the set , we train CNNs with low resolution images when present training epoch , and normally train with original resolution on the other epochs , , the variable resolution of training images is set as:

where represents the present training epoch, and as well as represent the original and reduced resolution, respectively. Note that we set epoch as the time slice rather than batch or even a single image for resolution change in the time domain, because we consider each epoch of training as one time of cognitive process of the model. When using our method, in each cognitive process, only the shape semantics is consistent while the texture information is difficult to guarantee uniformity due to its instability. Thus the correlation between the learned texture-related patterns and the correct label becomes spurious since the texture-related patterns fail to hold in the same manner as them held in the past. This would enforce the model to focus on the invariant and robust shape-related patterns, which has been proven to be beneficial in various ways.

Besides, fully connected layers in CNNs need fixed-size inputs for their definition, so we change the original average pooling layer into adaptive pooling layer to solve the dimension changes. This operation does not change any network parameters. Our method enjoys a plug-and-play property that it can be conveniently incorporated into almost any existing deep learning project by merely adding a few lines of code. We present the code-level description of TRD algorithm in Appendix A.

3.2 Discussion

Analysis on training and validation error: We analyze the effect of our method on training of ResNet-50 [he2016deep] for ImageNet [russakovsky2015imagenet] by comparing the top-1 training and validation accuracy curves of TRD with against the baseline in Figure 3.

First of all, we observe that after training with low resolution images, the network performance declines, but it would rise immediately after training with the original image information. The training and vaidation curves of TRD, as a result, both show a jagged upward and downward fluctuation. These fluctuations represent the stages when the model encounters the inconsistency of the learned texture-related patterns. Then, after repeated fluctuations where the model repeatedly ‘forgets’ the texture-related patterns, the model gradually relies on the shape-related patterns when making decisions, as evidenced by the smaller performance gap in the late stage of training. Finally, the TRD’s performance on validation set finally exceeds baseline after epoch 150 and further improves after epoch 225 when the baselines suffers from overfitting with increasing validation error. These findings show that our method does no harm in the model and is helpful in alleviating over-fitting. More training and validation curves of TRD under different parameter settings can be seen in the Appendix B.

Figure 3: Training and validation accuracy plots comparison between baseline and TRD on ImageNet.

Network visualization: To analyze what the model learn with TRD, we visually compare the activation maps for our method against the baseline, using the corresponding class activation maps (CAM) [zhou2016learning] on two image resolutions shown in Figure 2.

We notice that TRD successfully forces the model to understand the content of the image from the perspective of shape. Consequently, the model trained with TRD captures the targets more accurately. For example, when the cheeseburger is placed right next to the French fries, the baseline model would take the fries into consideration, which is in fact a symptom of shortcut learning [geirhos2020shortcut]. On the contrary, the model trained with TRD that absorbs more shape semantics handles this co-occurrence situation pretty well. Another interesting observation with the figure is that the model trained with TRD maintains a relatively uniform perception of images with varying resolutions, whilst the perception of baseline model would shift seriously when the testing image resolution mismatches with the original training resolution. These observations not only demonstrate the superiority of the perception of shape semantics, but also explain why TRD can help the model generalize on varying image resolutions.

4 Experiments

In this section, we investigate the effectiveness of our method for several major computer vision tasks. We first evaluate TRD combining with ResNet-50 on ImageNet. Next, we study the effect of TRD on object detection. Besides, we test the generalization of our method to different datasets, network structures and its compatibility with other regularization methods on image classification. All experiments are performed with Pytorch 

[paszke2017automatic] on Tesla M40 GPUs.

4.1 TRD significantly improves computational efficiency on the ImageNet classification.

ImageNet-1K [russakovsky2015imagenet] contains 1.2M training images and 50K validation images labeled with 1K categories. We use the standard augmentation setting for ImageNet dataset including resizing, cropping, and flipping. For fair comparison, all models are trained from scratch for 300 epochs with batch size 256 and the learning rate is decayed by the factor of 0.1 at epochs 75, 150, 225. We evaluate classification accuracy on the validation set and the highest validation accuracy is reported over the full training course following the common practice. Besides, FLOPs for network training are also evaluated. Our method uses different training resolutions on different epochs, so we calculate the weighted average value of training FLOPs in time domain as mFLOPs for comparision. We explore the performance of different data augmentation methods on ResNet-50 [he2016deep]. For Cutout [devries2017improved], the mask size is set to 112112 and the location for dropping out is uniformly sampled. The hyper parameter in Mixup [zhang2017mixup] is set to 1. We test the performance of our method with three reduced resolutions : 112, 84, 56 under three participation rates : 0.5, 0.7, 0.9. The results are illustrated on Table 1.

Method mFLOPs (G) Top-1 (%)
ResNet-50 (224224) 4.12 76.32
+Cutout 4.12 77.07
+Mixup 4.12 77.42
+AutoAugment* - 77.63
+DropBlock* - 78.13
+TRD (=112, =0.5) 2.58 (-37%) 78.16
+TRD (=112, =0.7) 1.96 (-52%) 77.96
+TRD (=112, =0.9) 1.34 (-67%) 77.71
+TRD (=84, =0.5) 2.35 (-43%) 78.18
+TRD (=84, =0.7) 1.64 (-60%) 77.80
+TRD (=84, =0.9) 0.93 (-77%) 77.26
+TRD (=56, =0.5) 2.19 (-47%) 77.62
+TRD (=56, =0.7) 1.42 (-66%) 76.75
+TRD (=56, =0.9) 0.64 (-84%) 75.20
Table 1: Summary of mFLOPs and validation accuracy of the ImageNet classification results based on ResNet-50. ‘mFLOPs’ is weighted average value of FLOPs in time domain. ‘*’ means results reported in the original papers.

Higher network performance: We measure the top-1 accuracy of the proposed method by comparing with data augmentation and regularization methods. Note that the improvement of TRD varies from 0.43% to 1.86% under different parameter settings. In particular, randomly reducing the training image resolution from 224224 to 8484 on 50% epochs achieves 78.18% top-1 accuracy on ResNet-50. Cutout and Mixup are representative data augmentation methods that generate more useful data from existing ones, and AutoAugment [cubuk2018autoaugment] uses reinforcement learning to find a combination of existing augmentation policies. However, TRD () outperforms Cutout, Mixup and AutoAugment by +1.11%, +0.76% and +0.55%, respectively. Besides, we notice that even if training images in 90% epochs are reduced to low resolution 8484, TRD can still bring significant gain to the network performance (+0.94%), which shows that training with only 10% normal images can still get better performance, as long as let the network learn the shape information by the comparison between different resolutions. Generally, training network only with low resolution images would cause drop in accuracy, but these results imply that low resolution images can help the network training by providing the comparison with original images in time domain.

Lower training computational cost: We also test average training FLOPs for TRD. The results in Table 1 show that our method can greatly reduce the training overhead varying from 37% to shocking 84%, which is the attach benefit of shrinking the image resolution: the matrix and floating-point operation will be greatly reduced when the network is trained with low resolution images. Note that TRD () can still improve 0.94% higher accuracy than baseline with saving 77% computing overhead, which proves that fixed resolution training would waste large quantities of computing resources especially on ImageNet. These results also illustrate that TRD can accelerate network training, dynamically save memory occupation in large cluster scenarios such as Google Cloud TPU, and thus reduce carbon emissions during training. Besides, we notice that the network performance declines only when reducing 90% training images to 5656 (One sixteenth of the area for the original image), which shows the parameter selection of our method is very broad and TRD can create a win-win stituation on computation cost and network performance without cautious balancing.

4.2 TRD achieves astonishing generalization on varying image resolutions.

Besides standard testing where images are resized to 224224 on ImageNet, we also perform a stress test on validation set of ImageNet. We compare the performance of baseline and models trained with TRD on different testing resolutions. Three models trained with TRD that randomly reduce 224224 images to 112112 within 50%, 70% and 90% epochs are chosen for comparison. Figure 4 shows the relative improvements of our method to baseline. TRD performs better than the baseline on all resolutions, and this improvement expands with the widening gap between training resolution and test resolution. The network trained with fixed resolution only remembers the texture information of its own scale, so its accuracy drops quickly on other scales. But TRD can improve model’s generalization on varying image resolutions by capturing more shape representations, so the larger the resolution difference, the higher the relative improvement of TRD. Besides, although we train the network with only two resolutions, our method can improve the recognition accuracy for images with all resolutions, which again proves that our method makes the model learn the invariance among different resolutions, i.e., the shape-based representation, instead of simply remembering the texture information of two resolutions.

Figure 4: Relative improvement of TRD on different test resolutions compared with baseline model.

4.3 Comparison with other methods.

We also compare TRD with other methods on ImageNet to verify the superiority of the proposed method. The compared methods are composed of the representative model compression methods such as Sparse Structure Selection(SSS) [huang2018data], Versatile Filters [wang2018learning], PFP [liebenwein2019provable], C-SGD [ding2019centripetal] and some latest adaptive inference methods including RANet [yang2020resolution] and DRNet [ming2021dynamic]. As shown in Table 2, TRD achieves better performance than other methods with less training mFLOPs. Note that model compression methods improve the computational efficiency by optimizing the network structure, but TRD achieves this goal by simply changing the training strategy.

Models mFLOPs (G) Top-1 (%)
ResNet-50 (Baseline) 4.12 76.32
SSS-ResNet-50 2.8 (-32%) 74.2
Versatile-ResNet-50 3.0 (-27%) 74.5
PFP-A-ResNet-50 3.7 (-10%) 75.9
C-SGD70-ResNet-50 2.6 (-37%) 75.3
RANet 2.3 (-44%) 74.0
DR-ResNet-50 3.7 (-32%) 77.5
DR-ResNet-50() 2.3 (-44%) 75.3
TRD (=112, =0.5) 2.58 (-37%) 78.16
TRD (=84, =0.5) 2.35 (-43%) 78.18
TRD (=56, =0.5) 2.19 (-47%) 77.62
Table 2: Comparison with other methods on the ImageNet dataset.
Datasets Models
Vanilla
Top-1 / mFLOPs
TRD (=16, =0.3)
Top-1 / mFLOPs
TRD (=16, =0.5)
Top-1 / mFLOPs

CIFAR-100 ()
ResNet-56 73.71% / 90.34M 75.10% / 70.01M 74.57% / 56.45M
ResNet-110 74.71% / 173.3M 76.54% / 134.31M 75.70% / 108.32M
VGGNet-19-BN 73.14% / 399.47M 73.52% / 309.59M 72.67% / 249.67M
DenseNet-100-12 77.25% / 297.86M 78.22% / 230.84M 78.11% / 186.17M
DenseNet-190-40 82.43% / 9.4G 84.06% / 7.29G 83.77% / 5.88G
Wide ResNet-28-10 81.27% / 5.25G 81.45% / 4.07G 80.06% / 3.28G

Table 3: Top-1 accuracy (%) and mFLOPs with and without TRD using different architectures on CIFAR-100. The results show that TRD is suitable for CNN models with different structures.
Datasets Models
Vanilla
Top-1 / mFLOPs
TRD (=0.3)
Top-1 / mFLOPs
TRD (=0.5)
Top-1 / mFLOPs
CIFAR-10 () ResNet-110 94.32% / 173.3M 95.32% / 134.31M 95.07% / 108.32M
CIFAR-100 () ResNet-110 74.71% / 173.3M 76.54% / 134.31M 75.70% / 108.32M
Tiny ImageNet () ResNet-110 62.42% / 693.26M 64.39% / 537.29M 64.57% / 433.31M
ImageNet () ResNet-50 76.32% / 4.12G 77.94% / 3.21G 78.16% / 2.58G
CUB-200-2011 () ResNet-50 66.73% / 16.47G 66.92% / 12.52G 66.54% / 10.3G
Stanford Dogs () ResNet-50 61.38% / 4.12G 66.18% / 3.21G 63.12% / 2.58G
Table 4: Top-1 accuracy(%) and mFLOPs with and without TRD using different datasets on ResNet-110 and ResNet-50. The results show that TRD is suitable for different kinds of datasets.

4.4 TRD enables the improved object detection performance.

In this subsection, we show TRD can also be applied for training object detector in Pascal VOC 

[everingham2010pascal] dataset. We use RetinaNet [lin2017focal] framework composed of a backbone network and two task-specific subnetworks for the experiments. The ResNet-50 backbone which is responsible for computing a convolutional feature map over an entire input image is initialized with ImageNet-pretrained model and then fine-tuned on Pascal VOC 2007 and 2012 trainval data. Models are evaluated on VOC 2007 test data using the mAP metric. We follow the fine-tuning strategy of the original method.

Models mAP (%)
RetinaNet (baseline) 70.14
+TRD Pre-trained (=112, =0.5) 71.61
+TRD Pre-trained (=112, =0.7) 71.59
Table 5: Object detection results on Pascal VOC with RetinaNet.

As shown in Table 5, the models pre-trained with TRD achieve higher accuracy than the baseline performance. The results suggest that the model trained with TRD can better capture the target objects. Besides, although the models trained with TRD on different participation rates have different accuracy in image classification, their performance in object detection tends to be consistent. This finding offers a compellingly simple explanation for how does TRD help network training: TRD helps the network to learn the shape-based representation.

4.5 TRD can be universally applied to various architectures.

We extensively evaluate TRD on various architectures: VGGNet-19-BN  [simonyan2014very], ResNet-56, ResNet-110, DenseNet-100-12 [huang2017densely], DenseNet-190-40 and Wide ResNet-28-10 [zagoruyko2016wide]. All the experiments are performed on CIFAR-100 [krizhevsky2009learning]. For fair comparison, all the experiments strictly use the same preprocessing and data augmentation strategies such as random flipping and cropping, following the common practice. For TRD, we test the performance with randomly reducing the training image resolution from 3232 to 1616 on 30% and 50% of epochs. Notably, the results in Table 3 show that our method consistently offers significant performance gains over the baselines on different structures, such as an impressive 1.67% top-1 error reduction of DenseNet-190-40 on CIFAR-100. It also means our method can reduce the running time of large-scale network and make the training of small network faster on the basis of improving the accuracy. Besides, although the resolution of images in CIFAR-100 is 3232, the network can still gain from 1616 images training, which proves TRD’s generalization for various image resolutions. On CIFAR-10 [krizhevsky2009learning], our method has also achieved stable improvement on these frameworks, the data analysis can be seen in Appendix C.

4.6 TRD is generally effective on multiple benchmark datasets.

We then extend the experiments to other benchmark datasets. In particular, we use the following six benchmark datasets: CIFAR-10, CIFAR-100,Tiny-ImageNet, ImageNet, CUB-200-2011 [wah2011caltech] and Stanford Dogs [khosla2011novel]. We consider these datasets should be able to cover a wide range of scenarios in the computer vision field. For TRD, we test the performance with randomly reducing the training image resolution to one quarter on 30% and 50% of epochs. As show in Table 4, TRD is suitable for datasets with different image resolutions, various image quantity and unequal numbers of classification kinds. Besides, CUB-200-2011 and Stanford Dogs are the representive datasets of fine-grained image classification, the performance of TRD on these two datasets shows that the shape information still has a high gain effect even on fine-grained image classification which requires more subtle differences. We also evaluate TRD is compatible with regularization and data augmentation techniques like Label Smooth [szegedy2016rethinking], Cutout [devries2017improved], Mixup [zhang2017mixup] and CutMix [yun2019cutmix], and the results are shown in Appendix D due to the limitation of space.

4.7 Ablation Studies

We conduct ablation studies on Tiny ImageNet dataset (6464) with ResNet-110.

Figure 5: Tiny ImageNet validation accuracy of TRD on ResNet-110 against reduced image size () and participation rate (). Red dotted line indicates the baseline performance.

Analysis on reduced resolution : We evaluate TRD when reduced resolution {8, 16, 24, 32, 40, 48, 56, 64} with setting participation rate to 0.5 as shown in left part of Figure 5. When is set to 56, the performance improvement is not obvious. It is because the difference between 5656 image and 6464 image is not obvious, the network can not capture the invariance (shape reprensentation) from the comparison between two similar resolutions. With the continuous reduction of image resolution, network performance continues to improve. Note that the performance improvement (+1.68%) can still be achieved when the image is reduced to one sixteenth (1616), which indicates that shape semantics can be preserved in very small images, and the choices for parameter are very broad. However, when is set to 8, the performance of network is lower than baseline. The reason is that 88 image is too small to carry basic shape semantics, which results in invalid training.

Effect of the participation rate : Specially, we call the ratio of epochs using reduced images to all epochs as participation rate . We set the reduced resolution to 3232 and show the effect of the in right part of Figure 5. With the increase of , the accuracy first rises and then decreases after reaching the highest when the is 0.4, which means the network is trained with 60% normal images and 40% reduced images. Besides, reducing 80% training images can still achieve the performance of baseline, but the amount of computation is saved by 60%. When training with 90% reduced images, the performance of the network decreases, which shows that only training with 10% normal images can not make network learn enough information on Tiny ImageNet.

Influence of different design choices: We explore different design choices for TRD. Table 6 shows the performance of TRD variations. ‘Batch-Wise TRD’ randomly reduces 50% training images into 1616 in every training batch, but it reduces the network performance by 1.1%. This result proves that reducing training images in epoch-wise is helpful for the network to capture shape semantics, while reducing training images in batch-wise may disturb the learning of the network. ‘Mix-Resolution TRD’ randomly reduces training images in half epochs to images ranging from 3232 to 6464. ‘Three-Resolution TRD’ randomly reduces training images in half epochs to 3232 or 4848. ‘Regular TRD’ reduces training images in each even number of epoch to 3232. Above variations lead to performance degradation compared to the original TRD, which shows the comparison of the two resolution is better on Tiny ImageNet, and randomness can help the network to learn the resolution change.

Methods Top-1 (%)
ResNet-110 (baseline) 62.42
+Proposed TRD () 64.57
+Batch-Wise TRD 61.32
+Mix-Resolutions TRD 64.01
+Three-Resolutions TRD 64.26
+Regular TRD 63.85
Table 6: Performance of TRD variants on Tiny ImageNet.

5 Conclusion

In this paper, we observe that the texture information would be devastated while the shape semantics maintains a relatively complete form during the image zooming operation. This consistency of shape semantics drives us propose a new training strategy: TRD. We randomly select some training epochs with a certain probability and reduce the training images to a smaller resolution in these epochs. It can be seen that TRD is very simple and requires mere several lines of code modification. Training with different resolutions in epoch-wise mode makes the network capture the shape semantics of training images, and surprisingly improves the computational efficiency. The robust shpae-based representation also pushes forward to generalization on varying image resolutions, which conforms to uperiority of the human vision system.

References

Appendix A Code-level Description of TRD Algorithm

We present the code-level description of TRD algorithm in Algorithm LABEL:al. We set a random seed evenly distributed between 0 and 1. At the beginning of each epoch, if the value of random seed is less than the threshold, i.e., the participation rate , the training images of each batch in this epoch would be reduced to a specific smaller size, i.e., the reduced resolution . Note that in practice we usually do not reduce image resolution in the two epochs before and after the learning rate decreases when adopting a step-decay policy, to avoid the probabilistic training instability. It can be seen that TRD is very simple and requires mere several lines of code modification, so it can be easily used in different tasks for higher computational efficiency.

Appendix B Training and Validation Curves of TRD

We show more top-1 training and validation accuracy curves of TRD with in Figure 7 and in Figure 8.

First of all, we observe that all curves of TRD show a jagged upward and downward fluctuation, and curve with fluctuates more frequently than curve with . Secondly, the model trained with low resolution images has lower trining accuracy, and the lower the resolution, the lower the training accuracy. But the TRD’s performance on validation set finally exceeds baseline after epoch 150 and further improves after epoch 225 when the baselines suffers from overfitting with increasing validation error, except the model of TRD () which is trained with too little information. These findings show that our method does no harm in the model and is helpful in alleviating over-fitting.

Appendix C TRD’s Performance on various architectures of CIFAR-10

We extensively evaluate TRD on various architectures: VGGNet-19-BN  [simonyan2014very], ResNet-56, ResNet-100, DenseNet-100-12 [huang2017densely], DenseNet-190-40 and WideResNet28-10 [zagoruyko2016wide]. All the experiments were performed on CIFAR-10 and CIFAR-100 [krizhevsky2009learning]. For fair comparison, all the experiments strictly use the same preprocessing and data augmentation strategies such as random flipping and cropping, following the common practice. For TRD, we test the performance with randomly reducing the training image resolution from to on 30% and 50% of epochs. Notably, the results in Table 8 show that our method consistently offers significant performance gains over the baselines on different structures, such as an impressive 1.37% top-1 error reduction on ResNet-110. It also means our method can reduce the running time of large-scale network and make the training of small network faster on the basis of improving the accuracy. Besides, the resolution of images in CIFAR-10 and CIFAR-100 is 3232, the network can still gain from 1616 images training, which proves TRD’s generalization for various image resolutions.

Appendix D TRD is compatible with various techniques.

In practice, scenarios in which more than one regularization or data augmentation techniques are used occur frequently. Therefore, compatibility with other techniques is also an important indicator of practical implication. Hence, we further apply TRD with some existing commonly used regularization and data augmentation techniques, such as Label Smooth [szegedy2016rethinking], Cutout [devries2017improved], Mixup [zhang2017mixup] and CutMix [yun2019cutmix]. It can be seen from Table 7 that TRD is able to make further improvements along with reducing the overhead, even on some strong techniques, which shows the wide applicability of our method, compatible to use with other methods.

Methods Vanilla (%) +TRD () (%)
Label Smoothing 62.46 65.03
Cutout 64.71 65.41
Mixup 65.34 65.97
CutMix 66.13 66.55
Table 7: Top-1 accuracy (%) of ResNet-110 (the top-1 accuracy of the vanilla model is 62.42%) with the combination of TRD and some commonly used techniques on Tiny ImageNet.

algocf[ht]    

Appendix E More Cam Visualizations

For better comparison, we include more cam visualization examples here in Figure 6.

Figure 6: Supplementary class activation mapping (CAM) visualizations on samples with two image resolutions (224 and 112). ‘Baseline’ denotes the vanilla ResNet-50 model to clearly see the effect of our method.

.

Figure 7: Training and validation accuracy plots comparison between baseline and TRD () on ImageNet.
Figure 8: Training and validation accuracy plots comparison between baseline and TRD () on ImageNet.
Datasets Models
Vanilla
Top-1 / mFLOPs
TRD (=16, =0.3)
Top-1 / mFLOPs
TRD (=16, =0.5)
Top-1 / mFLOPs
CIFAR-10 () ResNet-56 93.84% / 90.34M 94.76% / 70.01M 94.37% / 56.45M
ResNet-110 94.32% / 173.3M 95.32% / 134.31M 95.07% / 108.32M
VGGNet-19-BN 93.55% / 399.47M 94.28% / 309.59M 93.94% / 249.67M
DenseNet-100-12 95.30% / 297.86M 95.73% / 230.84M 95.65% / 186.17M
DenseNet-190-40 96.62% / 9.4G 96.93% / 7.29G 96.84% / 5.88G
Wide ResNet-28-10 96.16% / 5.25G 96.40% / 4.07G 96.22% / 3.28G
CIFAR-100 () ResNet-56 73.71% / 90.34M 75.10% / 70.01M 74.57% / 56.45M
ResNet-110 74.71% / 173.3M 76.54% / 134.31M 75.70% / 108.32M
VGGNet-19-BN 73.14% / 399.47M 73.52% / 309.59M 72.67% / 249.67M
DenseNet-100-12 77.25% / 297.86M 78.22% / 230.84M 78.11% / 186.17M
DenseNet-190-40 82.43% / 9.4G 84.06% / 7.29G 83.77% / 5.88G
Wide ResNet-28-10 81.27% / 5.25G 81.45% / 4.07G 80.06% / 3.28G

Table 8: Top-1 accuracy(%) and mFLOPs with and without TRD using different architectures on CIFAR-10 and CIFAR-100. The results show that TRD is suitable for CNN models with different structures.