Exploiting Features with Split-and-Share Module

08/10/2021 ∙ by Jaemin Lee, et al. ∙ 4

Deep convolutional neural networks (CNNs) have shown state-of-the-art performances in various computer vision tasks. Advances on CNN architectures have focused mainly on designing convolutional blocks of the feature extractors, but less on the classifiers that exploit extracted features. In this work, we propose Split-and-Share Module (SSM),a classifier that splits a given feature into parts, which are partially shared by multiple sub-classifiers. Our intuition is that the more the features are shared, the more common they will become, and SSM can encourage such structural characteristics in the split features. SSM can be easily integrated into any architecture without bells and whistles. We have extensively validated the efficacy of SSM on ImageNet-1K classification task, andSSM has shown consistent and significant improvements over baseline architectures. In addition, we analyze the effect of SSM using the Grad-CAM visualization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (CNNs) achieve high performance in various computer vision tasks [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei, Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick, Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele, Kuehne et al.(2011)Kuehne, Jhuang, Garrote, Poggio, and Serre]. A general anatomy of CNN splits the architecture into two parts: a feature extractor and a classifier [Bengio et al.(2013)Bengio, Courville, and Vincent]. A feature extractor consists of conv-blocks which are made of normalization layers, convolutional layers, non-linear activations [Nair and Hinton(2010)], and pooling layers. To design CNN architecture is to find a good conv-block and stack it repetitively. ResNet [He et al.(2016)He, Zhang, Ren, and Sun] added identity-based skip connections to the Conv-block to enable stable training even when the Conv-block is repeatedly stacked deeply. In addition, the Xception [Chollet(2017)] structure is a network structure developed from the Inception [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] structure. Xception utilizes Depth-wise-separable convolution using 1x1Conv to significantly lower the computation of the network and even improve its performance. Accordingly, the recent trend on neural architecture search [Zoph et al.(2018)Zoph, Vasudevan, Shlens, and Le, Real et al.(2019)Real, Aggarwal, Huang, and Le, Tan and Le(2019)]

focuses on designing better conv-blocks in a data-driven way. While the classifier is also a crucial part of a CNN, less attention has been paid on designing better classifiers. In this study, we focus on designing a classifier that further exploits a given feature vector. To the best of our knowledge, most of the CNN architectures simply adopt single or multiple linear combinations as the classifier.

In this work, we propose a novel classifier, named Split-and-Share Module (SSM). SSM divides the given feature into several groups of channels, and the groups are partially shared among sub-classifiers. Each group of channels has different degree of sharing; and our intuition is that the mostly shared group will contain general features, and vice versa. This feature split-and-share method can encourage the diversity of the features by structure, and thus the diversity of the sub-classifiers, leading to higher performances when ensembled.

Figure 1: An overview of SSM. The illustrated example has 2048 channels in the final feature vector, and the output is 1000-way classification.

Figure 1 shows the structure of the proposed SSM. Given a feature vector extracted from the backbone network (feature extractor), SSM splits the feature into 4 groups and each group is fed into the designated sub-classifier. The final output is the sum of outputs from each sub-classifier. The smallest group, illustrated as the bottom group in Fig. 2, is shared by all other sub-classifiers, and should contribute to the final prediction alone. It is encouraged to learn more common and general features in the limited number of channels. On the other hand, the least shared channels, illustrated as the top group in Fig. 2, will learn additional features such as contextual information. The Grad-CAM [Selvaraju et al.(2017)Selvaraju, Cogswell, Das, Vedantam, Parikh, and Batra] visualization in Fig. 2 qualitatively supports our intuition. As shown in Fig. 2, the first column shows the acoustic guitar taken by Grad-CAM for each channel group. We can see that going down from the first row to the bottom row, starting with the additional characteristics of the acoustic guitar and gradually visualizing it as the core characteristic of the acoustic guitar. SSM shows stable performance improvement in architectures such as ResNet [He et al.(2016)He, Zhang, Ren, and Sun] and ResNeXt [Xie et al.(2017)Xie, Girshick, Dollár, Tu, and He], and is a simple structure consisting of BatchNorm [Ioffe and Szegedy(2015)]

and ReLU, easy to attach to any CNN architecture.

While the sub-classifiers may resemble the ensemble technique, which may lead to concerns on less improvements with ensemble. In our experiments, we show that a SSM-augmented network can further be improved with ensemble without any compromises.

Figure 2: Grad-CAM visualization of channels with respect to sub-classifiers. ResNet-50+SSM is used for visualization, and the final feature size is 2048. Each column visualizes 512 channels. The first column is visualized with respect to FC1, the second column is visualized w.r.t. FC2, and so on. More details will be described in Sec. 5.

2 Related works

2.1 Deeper architectures

Starting with AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton], many CNN structures have been proposed. VGGNet [Simonyan and Zisserman(2014)] showed significant CNN performance improvement by increasing network depth. Another study made network learning stable by normalizing the input to each layer in batch units. Based on these developments, ResNet was proposed. ResNet proposed identity-based skip connections to deepen the network, greatly improving the performance of CNN. Since then, studies have been proposed to discover CNN structures through architecture search such as NASNet [Zoph et al.(2018)Zoph, Vasudevan, Shlens, and Le] and EfficientNet [Tan and Le(2019)]

. Also, architecture search methods based on Evolutionary Algorithms such as AmoebaNet 

[Real et al.(2019)Real, Aggarwal, Huang, and Le] have been proposed. Studies of these CNN structures have been continuously proposed and attracted great attention. But in the progress of the structure of these CNN classifier has been excluded. We conducted a study using features that were already well extracted from the feature extractor, and can be used in the architecture search later.

2.2 Feature analysis

Various analyses of features have been proposed. Ilyas et al. [Ilyas et al.(2019)Ilyas, Santurkar, Tsipras, Engstrom, Tran, and Madry] was experimental in that it exists as robust and non-robust features and can be classified, rather than simply dividing features into useful and useless features. From the perspective of Ilyas et al. [Ilyas et al.(2019)Ilyas, Santurkar, Tsipras, Engstrom, Tran, and Madry], there are robust features and non-features among features used for prediction, both of which are useful for prediction but have different meanings and have room to utilize these characteristics.

Afalo et al. [Aflalo et al.(2020)Aflalo, Noy, Lin, Friedman, and Zelnik] showed that using all the features that enter the classifier does not improve the accuracy of the CNN, but removing unnecessary features through pruning can improve the performance and the computational speed of the CNN.

In this respect, our SSM is also a new analysis and utilization of features. Our SSM forcibly assigns the role of features used for prediction through backprop, and qualitatively and quantitatively analyzes the effect on the network according to the location and number of such features.

3 Split-and-Share Module

In this section, we describe how SSM is formulated. SSM is a simple classifier that splits and share features with multiple sub-classifiers. The overall architecture of SSM is illustrated in Fig. 1, and the pseudo code algorithm is described in Algorithm 1.

First, SSM equally divides the input feature in 4 splits, and sequentially append the splits one-by-one to formulate 4 features with different numbers of channels. For example, given the feature , the first feature contains the first 1/4 channels, i.e. . Accordingly, contains the first 1/2 channels, and so on. In order to diversify the 4 features while keeping the feature domain with minimum overheads, we apply BatchNorm with ReLU to the first 3 features for simple scaling and non-linear activation. The resulting 4 features will have the same semantic meaning with different scales for the shared channels. Channels in the 4 features can be zeroed out by ReLU. BatchNorm and ReLU are essential in SSM, as they add extra non-linearity to the overall process. Without BatchNorm and ReLU, SSM can be reduced to a simple linear combination (fully-connected) layer. After splitting, recombining and re-scaling, the 4 features are feed-forwarded to 4 sub-classifiers. Each sub-classifier is a simple fully-connected layer, where the output dimension is the number of classes. The final output of SSM is the average of the 4 outputs from the sub-classifiers.

The key intuition of our design is to partially share the given feature. The first 1/4 channels are shared among all sub-classifiers. These channels are forwarded 4 times and back-propagated 4 times. As they are most frequently used channels, we expect these channels are trained to be the most important key features. In contrast, the last 1/4 channels are used only by the last sub-classifier, so they are expected to contain some additional features, such as context information on the surrounding environments. We visualized the 4 splits of channels with the Grad-CAM visualization technique in Fig. 2 and Fig. 3, and more analysis will be discussed in Sec. 5.

1:procedure SSM(=2048, =4)
2:     INT / )
3:     ]
4:     for =1 to  do
5:         
6:         BatchNorm
7:         ReLU
8:         FC
9:         .append()
10:     end for
11:     .sum() /
12:     return
13:end procedure
Algorithm 1 Split-and-Share Module

4 Experiments

In this section, we validate the efficacy of the proposed SSM on various architectures, and analyze the effect of SSM in several aspects. First, we use SSM upon ResNet and ResNeXt architectures in ImageNet-1K classification dataset [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei]. SSM has shown performance improvements in most cases, and details will be described in Sec. 4.1. In Sec. 4.2, we describe the ablation studies of SSM.

4.1 ImageNet-1K classification

The ImageNet-1K dataset [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] consists of 1.28 million training images and 50k validation datasets. We follow the training details in ResNet [He et al.(2016)He, Zhang, Ren, and Sun]

. During training, the images are resized to 256x256 shape, and randomly cropped to 224x224 patches with random horizontal flipping. During testing, the images are also resized to 256x256 shape, and a single 224x224 patch is cropped at the center. For both training and testing, images are normalized with the mean and standard deviation of all pixels in the dataset. We adopt He’s method 

[He et al.(2015)He, Zhang, Ren, and Sun]

for network random initialization. We use SGD optimizer with base learning rate 0.1 and batch size of 256. The running rate is reduced by one-tenth at epoch 30 and 60, and the total number of epochs is 90.

The weight decay value is set to 0.0001 and the momentum value is set to 0.9.

Architecture Dataset Epoch Top-1 Acc
ResNet-18 [He et al.(2016)He, Zhang, Ren, and Sun] ImageNet-1K 90 70.04%
ResNet-18 [He et al.(2016)He, Zhang, Ren, and Sun] + SSM ImageNet-1K 90 70.05%(+0.01%)
ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] ImageNet-1K 90 75.65%
ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] + SSM ImageNet-1K 90 76.68%(+1.03%)
ResNet-101 [He et al.(2016)He, Zhang, Ren, and Sun] ImageNet-1K 90 76.62%
ResNet-101 [He et al.(2016)He, Zhang, Ren, and Sun] + SSM ImageNet-1K 90 77.93%(+1.31%)
ResNeXt50 [Xie et al.(2017)Xie, Girshick, Dollár, Tu, and He] ImageNet-1K 90 77.19%
ResNeXt50 [Xie et al.(2017)Xie, Girshick, Dollár, Tu, and He] + SSM ImageNet-1K 90 77.96%(+0.77%)
ResNeXt101 [Xie et al.(2017)Xie, Girshick, Dollár, Tu, and He] ImageNet-1K 90 78.46%
ResNeXt101 [Xie et al.(2017)Xie, Girshick, Dollár, Tu, and He] + SSM ImageNet-1K 90 79.68%(+1.22%)
Table 1: Classification results on ImageNet-1K. Single-crop validation errors are reported.

The experiment result is summarized in  Table 1. SSM has consistently improved performance in all the architectures, except ResNet-18 that does not improve. The distinctive difference between ResNet-18 and other architectures is that the final feature of ResNet-18 has 512 channels, while others have 2048 channels. Therefore, we assume that the number of channels in the final feature should be large enough for SSM to be effective. In all architectures except ResNet-18, the performance improvement is significant. Furthermore, the absolute improvements in larger architectures are greater than the smaller ones. ResNet-101 improves 1.31% in the top-1 accuracy, while ResNet-50 improves 1.03%; ResNeXt-101 improves 1.22%, while ResNeXt-50 improves 0.77%.

4.2 Ablation studies and analysis

4.2.1 Training scheme for sub-classifiers

There are 2 simple ways to train the 4 sub-classifiers: apply the classification loss to individual sub-classifier outputs, or apply the loss to the average of the outputs. The former one requires each sub-classifier to independently learn to classify, and then ensemble the 4 sub-classifiers; the latter one allows the sub-classifiers to jointly learn to classify. The results are summarized in Table 2. When individually trained, the sub-classifiers’ performances are much higher than the jointly trained ones. Interestingly, the final ensemble performance is significantly higher in the jointly trained one. The result indicates that jointly training the sub-classifiers will encourage the sub-classifiers to have different roles to create synergy, and thus the final ensemble performance is higher than the independently trained one.

Architecture Dataset FC1 Acc FC2 Acc FC3 Acc FC4 Acc Averaging Acc
ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] + SSM ImageNet-1K 65.24% 73.24% 75.09% 1.02% 76.68%
ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] + SSM-individual ImageNet-1K 75.60% 75.11% 76.18% 74.77% 75.38%
Table 2: Results of ImageNet-1K classification according to two different training schemes. SSM is the result of training with the loss given to the averaged output; SSM-individual is the result of training each output independently.

4.2.2 Is SSM a new way of ensemble?

Ensemble is a simple technique to further boost performance by combining multiple models that have different random initializations. The sub-classifiers of SSM may resemble the ensemble technique, and there may be concerns that SSM benefits from the ensemble-like effect and thus may not benefit from ensemble. However, we argue that SSM is not simply an ensemble method, and we validate that SSM-augmented models can further benefit from ensemble. We train two ResNet-50 models and two ResNet-50 + SSM models with different initializations, and test if SSM can further benefit from ensembles. The results are summarized in Table 3. The two ResNet-50 + SSM models’ accuracies are 76.37% and 76.68%, and the ensembled accuracy is 78.04%, which is 1.35% higher. The improvement is a little less than the ResNet-50 ensemble, but it may be simply due to the performance saturation, and the 1.35% is still a significant improvement by ensemble. Therefore, through this experiment we show that SSM-augmented models can further benefit from ensemble.

Architecture Dataset Epoch Top-1 Accs Ensembled Acc
ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] ImageNet-1K 90 75.60%, 75.65% 77.25% (+1.60%)
ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] + SSM ImageNet-1K 90 76.37%, 76.68% 78.04% (+1.35%)
Table 3: Results of ensemble classification in ImageNet-1K. All experiments are conducted in the same environment. We separately train the two models for two times each.

4.2.3 Is the improvement simply due to parameter increases?

Finally, we show that the efficacy of SSM is not simply due to parameter increase. To verify this, we further train two models with more parameters by adding more parallel classifiers. As shown in Table 4, the base ResNet-50 has 25.55M parameters, and ResNet-50 + SSM has 28.58M parameter, so the parameter overhead is 3.03M. One fully connected layer has 2.05M parameters, so we add one or two parallel fully connected layers to the baseline ResNet-50. ResNet-50 (2FC) and ResNet-50 (3FC) are the new comparison methods that brings additional parameters in the classifier part, like SSM. The result is summarized in Table 4. A simple increase in parameters, like ResNet-50 (2FC) and (3FC), does not improves the performance much, but SSM does bring a significant improvement. Therefore, we argue that the performance improvement is not simply due to parameter increase, but due to the feature exploiting characteristics of SSM.

Architecture Dataset Epoch Top-1 Acc Parameters
ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] ImageNet-1K 90 75.65% 25.55M
ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] + SSM ImageNet-1K 90 76.68% 28.63M
ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] (2FC) ImageNet-1K 90 75.91% 27.60M
ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun] (3FC) ImageNet-1K 90 75.67% 29.65M
Table 4: The result of the parameter increase in ImageNet-1K. In all experiments, FC was added vertically, and all were ensembled using the averaging method.

5 Qualitative analysis

The key intuition of SSM is to partially share the features among different sub-classifiers. As described in Sec. 3, the first 1/4 channels are shared among all sub-classifiers, and the last 1/4 channels are used only by the last sub-classifier. The first 1/4 channels are the most frequently feed-forwarded and back-propagated, and are expected to contributes mostly to the final prediction of SSM. In short, our hypothesis is that the degree of sharing is positively correlated to the importance of the feature. Therefore, the first 1/4 channels are expected to contain the key features to classify among the target classes, and the last 1/4 channels are expected to contain additional features such as contextual information.

Figure 3: Additional Grad-CAM visualization results.

We qualitatively analyze the channels with the Grad-CAM visualization. Fig. 2 and Fig. 3 shows input images from the validation set and the overlaid Grad-CAM heatmaps with respect to the ground-truth labels. ResNet-50+SSM is the visualization target. To analyze whether the feature splits have learned differently, the visualizations are generated for each 1/4 split of channels, instead of the full feature. The column ‘Channel 0~511’ denotes the Grad-CAM of the first 1/4 channels with respect to the first sub-classifier. The column ‘Channel 512~1023’ denotes the visualization of the second 1/4 channels w.r.t. the second sub-classifier, and so on. While the input to the second sub-classifier is the first 1/2 channels, we visualized the second 1/4 to explicitly compare the semantics learned in each 1/4 channels.

The samples in Fig. 2 and Fig. 3 supports our intuition. The Grad-CAMs for the first sample in Fig. 2 demonstrate that the first 1/4 channels focus on the ground-truth ‘guitar’ location, and the last 1/4 channels focus on the corresponding context, which in this case is the guitar player. The two intermediate Grad-CAMs gradually changes from the key feature of the guitar to the corresponding context of the guitar player. The Grad-CAMs of the second sample in Fig. 2 also show that the first 1/4 channels focus on the fish, and the last 1/4 channels focus on the corresponding context of the river. We demonstrate more samples in Fig. 3. In summary, the Grad-CAM visualizations for each split channels show that the most shared channels focus on the target object, and the least shared channels focus on the corresponding context information.

6 Discussion

We experimented with how much performance improvement would be achieved by selecting the best inferred output of each output and using it for prediction in the ResNet-50+SSM structure learned from the training set of ImageNet-1K. We have not yet developed an algorithm to select the optimal FC, so we have selected the optimal FC by ourselves, utilizing the label data of ImageNet-1K. The result was that if FC could be ideally chosen, an additional 6% performance improvement could be seen with up to 82.9% performance. Further research on this part is believed to be possible.

7 Conclusion

We propose Split-and-Share Module (SSM). SSM is a classifier that improves the performance of CNN networks. We apply the BatchNorm and ReLU to the shared features extracted from the feature extractor, limiting the number of commonly used and non-featured backprops, and have an effect of learning by placing weights on important features. Through this process, features learned according to importance have a classifier suitable for their capacity, and averaging multiple outputs from the classifier for use in training and testing. We verified SSM by applying CNNs of various structures in ImageNet-1K, and showed significant performance improvement in all the experiments. We also adopted Grad-CAM for qualitative analysis of SSM. Grad-CAM results showed qualitatively that our SSM could learn according to the importance of features as we intended. In addition, SSM divides features into four groups and learns features that are common or non-common features, and these characteristics are thought to be available in many areas of research.

References

  • [Aflalo et al.(2020)Aflalo, Noy, Lin, Friedman, and Zelnik] Yonathan Aflalo, Asaf Noy, Ming Lin, Itamar Friedman, and Lihi Zelnik. Knapsack pruning with inner distillation. arXiv preprint arXiv:2002.08258, 2020.
  • [Bengio et al.(2013)Bengio, Courville, and Vincent] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • [Chollet(2017)] François Chollet.

    Xception: Deep learning with depthwise separable convolutions.

    In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 1251–1258, 2017.
  • [Cordts et al.(2016)Cordts, Omran, Ramos, Rehfeld, Enzweiler, Benenson, Franke, Roth, and Schiele] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele.

    The cityscapes dataset for semantic urban scene understanding.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [He et al.(2015)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [Ilyas et al.(2019)Ilyas, Santurkar, Tsipras, Engstrom, Tran, and Madry] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, pages 125–136, 2019.
  • [Ioffe and Szegedy(2015)] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [Kuehne et al.(2011)Kuehne, Jhuang, Garrote, Poggio, and Serre] Hildegard Kuehne, Hueihan Jhuang, Estíbaliz Garrote, Tomaso Poggio, and Thomas Serre. Hmdb: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563. IEEE, 2011.
  • [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick.

    Microsoft coco: Common objects in context.

    In European conference on computer vision, pages 740–755. Springer, 2014.
  • [Nair and Hinton(2010)] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In

    Proceedings of the 27th international conference on machine learning (ICML-10)

    , pages 807–814, 2010.
  • [Real et al.(2019)Real, Aggarwal, Huang, and Le] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. In

    Proceedings of the aaai conference on artificial intelligence

    , volume 33, pages 4780–4789, 2019.
  • [Selvaraju et al.(2017)Selvaraju, Cogswell, Das, Vedantam, Parikh, and Batra] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  • [Simonyan and Zisserman(2014)] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [Tan and Le(2019)] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
  • [Xie et al.(2017)Xie, Girshick, Dollár, Tu, and He] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
  • [Zoph et al.(2018)Zoph, Vasudevan, Shlens, and Le] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.