Deep neural networks have demonstrated remarkable performance in various fields such as object detection, semantic segmentation, and image classification Ren et al. (2015); Redmon et al. (2016); Long et al. (2015); Ronneberger et al. (2015); He et al. (2015); Tan and Le (2019). The performance of deep neural networks varies depending on the architecture. State-of-the-art results have been obtained by designing large neural networks with increased depth He et al. (2016), width Zagoruyko and Komodakis (2016), and cardinality Xie et al. (2017).
In architecture design, one of the factors often considered is the receptive field Araujo et al. (2019). For example, if two convolutions are applied, a resulting feature covers a area. As such, the pixel-level area covered by a specific feature is called the theoretical receptive field. Meanwhile, effective receptive field was proposed by Luo et al. (2016). Contrary to the previous square-shaped theoretical receptive field, the effective receptive field illuminates the actually activated pixels through gradients, whose shape appears as a 2D Gaussian.
Many studies have preferred enlarging the receptive fields to obtain performance gain Tsai et al. (2018); Fu et al. (2018); Singh and Davis (2018); Kim et al. (2016); Johnson et al. (2016); Shi et al. (2020); Plötz and Roth (2018). Further, Araujo et al. (2019) conjectured the relationship between the size of the receptive field and the classification accuracy. However, this study points out that these common practices should be reconsidered. For modern convolutional neural networks (CNNs), we measured the size of the receptive field. We observed that the large receptive field cannot guarantee the performance superiority of the neural network. For example, some neural networks exhibit high accuracy but have a smaller receptive field. This is because the size of the receptive field reflects only the depth or kernel size and does not reflect the width or cardinality.
In addition to examining the size of the effective receptive field, we investigate the shape of the effective receptive field. We further obtained the effective receptive field of the final output. Conventionally, for a CNN, every pixel, or at least an adjacent pixel, is expected to contribute almost equally to the final output. In other words, it would be strange if a pixel at a specific location is partially dead, and the dead pixel has little effect on the output for any data. Surprisingly, we found that the partially dead pixels exist. In modern CNNs such as ResNet, strong pixels and weak pixels exist for any data. Here, the contribution to the output is significantly different for strong pixels and weak pixels. We will show that this pixel sensitivity imbalance is significant even for adjacent pixels. This pixel sensitivity imbalance occurs when an operation with an odd-sized kernel is applied with stride 2. A solution to this problem is provided in Section4.
Is pixel sensitivity imbalance a bug or a feature? We compared the performance of CNNs after reducing the pixel sensitivity imbalance. Interestingly, pixel sensitivity imbalance does not degrade but rather enhances the neural network’s performance. In this respect, pixel sensitivity imbalance is a feature for general vision tasks. However, when pixel sensitivity imbalance exists, it is difficult to capture small perturbations in images. In contrast, when the pixel sensitivity imbalance is reduced, the neural networks easily capture small perturbations in images. In this regard, pixel sensitivity imbalance is a bug for some special tasks.
2 Preliminaries: Receptive Field
In the theoretical receptive field, the largest pixel-level area covered by the target feature is investigated by tracing backward operations in the CNN. For example, if three convolutions are applied, one target feature has a theoretical receptive field. If any of these operations have a stride greater than 1, the target feature will cover a larger area, resulting in a wider theoretical receptive field Araujo et al. (2019).
However, as the theoretical receptive field is the theoretical maximum area covered by a target feature, it is far from the practical behavior of the neural network. In the effective receptive field, gradients are used to examine the actual pixels that affect the target feature. Contrary to the theoretical receptive field that appears as a square, the effective receptive field appears as a 2D Gaussian.
Here we provide a detailed formulation of our trick to obtain the effective receptive field. Suppose an image is given. The image is passed through given CNN, resulting in a target feature map . For effective receptive field, the goal is to represent the spatial relationship between pixel-level and feature-level . Therefore, the channel of the image and feature map should be ignored and averaged. First, we define,
which is the averaged feature over channel for the spatial center in the target feature map. Then we compute the gradient w.r.t image, . By averaging the gradient over the channel , we obtain,
which represents how pixel affects the central feature for the given image. However, the from a single image is sparse and depends on the image. By averaging over a sufficiently large number of data, the nature of the neural network can be obtained.
However, if some has a negative value, it cancels out with a positive . As we want to obtain the accumulation of pixel contributions, we ignore negative importance Selvaraju et al. (2017); Chattopadhay et al. (2018). Thus, we pass through the Glorot et al. (2011):
Now, represents the general contribution property of pixel to the target feature, i.e., the effective receptive field. In summary, we need first to calculate for each image, pass it through
, and then average it over a sufficiently large dataset. In the modern deep learning environment using mini-batch, applyingto the gradient for each image can be difficult. We recommend using batch size 1 to correctly accumulate for each image. Accumulating over a sufficiently large amount of data yields a clean, high-quality effective receptive field that well describes the internal behavior of a neural network.
3 Size Test on the Effective Receptive Field
|Model||Valid Acc||Test Acc||TRF|
Here, we investigate the size of the effective receptive field of modern CNNs. Target CNNs are ResNet and its variants He et al. (2016); Zagoruyko and Komodakis (2016); Xie et al. (2017), which are widely used in various vision tasks. We used torchvision.models Paszke et al. (2019, 2017)
that were pre-trained on ImageNetRussakovsky et al. (2015). For each model, we summarized the top-1 accuracy and top-5 accuracy reported. We also computed the size of the theoretical receptive field for each model. Note that since ResNet differs in detailed architecture for each implementation, the sizes of the theoretical receptive field for torchvision.models are different from that of the TensorFlow models Araujo et al. (2019); Silberman and Guadarrama (2016).
For each pre-trained model, we obtained an effective receptive field using the test dataset from the CUB-200-2011 datasetWah et al. (2011). Here, we set the target feature map as the last feature map of layer-4. For the effective receptive field using layer-3 as the target feature map, refer to the supplementary material. The obtained effective receptive field was fitted with 2D Gaussian using the Lmfit library Newville et al. (2016). The resulting and indicate how large the effective receptive field is. These results are summarized in Table 1. Our major observations are summarized as follows.
Observation 1. The size of the theoretical receptive field does not describe the classification accuracy.
We observed that the classification accuracy of CNNs is not proportional to the size of the theoretical receptive field. For example, Wide-ResNet-50-2 and ResNet-152 show similar classification accuracy, but the size of their theoretical receptive field is 427 and 1451 pixels, respectively. This is because the theoretical receptive field reflects only depth or kernel size and cannot reflect width or cardinality. On the other hand, ResNet-34 has a large theoretical receptive field of 899 pixels because it uses early convolution with stride 2 in residual blocks, unlike ResNet-50. As such, ResNet-34 has a wider theoretical receptive field than ResNet-50, but its classification accuracy is lower. These observations are inconsistent with the conjecture Araujo et al. (2019) that the classification accuracy tends to be proportional to the size of the theoretical receptive field.
Observation 2. The size of the effective receptive field does not describe the classification accuracy.
We observed that the classification accuracy of CNNs is also not proportional to the size of the effective receptive field. In other words, even from the viewpoint of the effective receptive field, the large receptive field does not guarantee superiority in performance. Meanwhile, when the depth increases within these ResNets, the size of the effective receptive field does not increase further and saturates to a certain size. These results are different from the study of Luo et al. (2016), which reported that the size of the effective receptive field tends to be proportional to .
The same experiment was performed once more. First, each pre-trained ResNet was fine-tuned on the Caltech-101 dataset Fei-Fei et al. (2004)
. We replaced the last fully connected layer to output for 101 classes. For training, stochastic gradient descent with momentum 0.9Sutskever et al. (2013)
, learning rate 0.01, weight decay 0.0005, batch size 64, epochs 200, and cosine annealing schedule with 200 iterationsLoshchilov and Hutter (2016) was used. For data augmentation, random resized crop with size 256, random rotation with degree 15, color jitter, random horizontal flip, center crop with size 224, and mean-std normalization was applied. The train/val/test set was split at a ratio of 70:15:15. Within 200 epochs, the model with the best validation accuracy was obtained and evaluated.
For each fine-tuned model, an effective receptive field was obtained using the test dataset, and its size was investigated (Table 2). Similarly, the size of the theoretical receptive field and the effective receptive field do not agree with the trends in classification accuracy. Therefore, we conclude that the size of the receptive field is not a representative indicator of classification accuracy, nor architectural superiority.
4 Shape Test on the Effective Receptive Field
In the previous section, we fitted each effective receptive field to a 2D Gaussian. Although showed near 0.9, those effective receptive fields did not perfectly match the 2D Gaussian. To understand this behavior, we visualize the obtained effective receptive field.
For the ResNeXt-101-32x8d, we plotted the effective receptive field (Figure 1). Although the effective receptive field appears as 2D Gaussian, a checkboard pattern exists inside. Therefore, the effective receptive field imperfectly matched the 2D Gaussian because of the internal checkboard pattern.
Additionally, we accumulated to obtain an effective receptive field of output . This is what we call the dead pixel test. In general, the entire pixels are expected to contribute almost equally to output . However, even for the effective receptive field of , we discovered that the checkboard pattern exists.
feature map. (Right) Effective receptive field of the output. The top row is the effective receptive fields for the existing ResNeXt-101-32x8d before kernel padding, showing the checkboard pattern. The bottom row is the effective receptive fields after kernel padding, showing no checkboard pattern. Best viewed electronically with zoom.
The existence of this checkboard pattern implies that modern CNNs recognize images in a highly counterintuitive way. Some pixels are weak, partially dead, and hardly contribute to the output. Conversely, some pixels are strong and more sensitive to output. We call this phenomenon pixel sensitivity imbalance. As the checkboard pattern appears locally, even in adjacent pixels, the pixel sensitivity differs significantly.
Why does the checkboard pattern appear? We found that it occurs when an odd-sized kernel is applied with stride 2 (Figure 2). For example, when a convolution is applied with stride 2, overlapping regions appear. Pixels within the overlapping regions are referenced more in operation, while other pixels are not. As this phenomenon accumulates, some pixels become more influential while others do not. When viewed in 2D, a checkboard pattern appears. This phenomenon is highly similar to the checkboard pattern when using deconvolution in image generation tasks Odena et al. (2016). Extending this, we emphasize that the checkboard pattern exists from the perspective of gradient even when using convolution.
Despite these potential problems, odd-sized kernels with stride 2 are widely used in modern CNNs Huang et al. (2017); Szegedy et al. (2015); Iandola et al. (2016); Krizhevsky et al. (2012); Ma et al. (2018); Sandler et al. (2018). For example, in the early stage of ResNets, Conv with stride 2 and Pool with stride 2 are used. Further, in the downsampling operation in the residual block, Conv is used with stride 2, which subsamples only the specific input and shuts off the flow in other locations.
Here, we would like to modify those problematic odd-sized kernels with stride 2. As ResNet and its variants have similar architectures, most can be modified with similar rules. Not all layers need to be modified. The operations to be modified are as follows: Conv with stride 2 and Pool with stride 2 in early stage, and Conv with stride 2 and Conv with stride 2 across all residual blocks. We replace those kernels with even sizes such as or .
However, when replacing with a new kernel, the existing pre-trained weights are discarded. To construct an even-sized kernel while boosting training through pre-trained weights, we propose kernel padding method. For the target odd-sized pre-trained weight, zero-padding is applied to the bottom and right sides to obtain an even-sized kernel (Figure 3). As the kernel is zero-padded, the operation is equivalent to the previous one. Accordingly, pre-trained weights can be enjoyed. Moreover, as the new zero-padded weights are trainable, during fine-tuning, they can be merged into the existing weights.
We applied kernel padding to the ResNets pre-trained on the ImageNet and then fine-tuned them on the Caltech-101 dataset. The training details used in fine-tuning are the same as the experiments in Section3. Now the effective receptive field of our architecture has no checkboard pattern (Figure 1).
The degree of the pixel sensitivity imbalance can be measured through the smoothness of the effective receptive field of output . Here, we define two indices, first-order imbalance index and second-order imbalance index :
In other words, we pass the effective receptive field through the difference filters and compute spatial average to evaluate its local variation and curvature. The smaller these values are, the more locally smooth the effective receptive field is. Conversely, the larger the value, the greater the imbalance.
Using these two indicators, we evaluated the degree of pixel sensitivity imbalance before and after applying kernel padding (Figure 4). Existing ResNets show large and , which indicates that pixel sensitivity imbalance is significant even in adjacent pixels. After applying the kernel padding, the imbalance decreased across all ResNets.
Is pixel sensitivity imbalance a bug or a feature? In other words, if the pixel sensitivity imbalance is reduced, can the superiority of the architecture be guaranteed? For both perspectives, we provide some conjectures.
Pros: Pixel Sensitivity Imbalance is a Feature.
Even with pixel sensitivity imbalance, ResNets have been widely used in various vision tasks so far. Although some pixels are partially dead, they are not entirely dead. The difference between strong and weak pixels is a matter of contribution degree, and they are all involved in the output.
. They improve the performance of neural networks by dropping some neurons or inputs. For understanding the global context of an image, it is fine if some trivial input is missing. Furthermore, dropping some input induces the CNN to understand the image in a different way, introducing a regularization effect.
Further, pixel sensitivity imbalance can be interpreted as rescaling a given image according to strong and weak pixels. As the image is rescaled pixel-wise, when a translated image is given, it is recognized as a completely different image. Accordingly, pixel sensitivity imbalance increases image diversity, thereby boosting the effect of data augmentation.
Here, we examined how pixel sensitivity imbalances affect the performance in a general vision task. As kernel padding reduces pixel sensitivity imbalance, we compared the performance of ResNet and its variants before and after applying kernel padding. We performed fine-tuning on the Caltech-101 dataset, and the experimental details such as the training method and data augmentation are the same as in Section 3.
|Model||Before KP||After KP||Diff|
For each model, we measured the average test accuracy from three experiments (Table 3). We observed that the performance is rather decreased after kernel padding. This means that pixel sensitivity imbalance is not a bug for a general image classification task but is a feature that improves performance. Therefore, reducing pixel sensitivity imbalance does not guarantee architectural superiority.
Cons: Pixel Sensitivity Imbalance is a Bug.
Nevertheless, pixel sensitivity imbalance gives rise to several potential problems.
First, consider the saliency methods that visualize the inner behavior of a neural network. Many saliency methods have investigated important pixels based on gradients Simonyan et al. (2013); Springenberg et al. (2015); Smilkov et al. (2017); Sundararajan et al. (2017); Shrikumar et al. (2017). However, saliency methods do not reflect pixel sensitivity imbalance. In other words, the gradient-based saliency map is affected by the checkboard pattern of the neural network. Thus, the gradient-based saliency method is only suitable for examining the pixels that contribute to the output of the neural network and is unsuitable for evaluating the intrinsic importance of a pixel.
As mentioned earlier, since pixel sensitivity imbalance introduces pixel-wise rescaling, the translated image is perceived as a completely different image. This increases the data augmentation effect but worsens the translation invariance of CNN Zhang (2019); Azulay and Weiss (2019); Cohen and Welling (2016). In a practical application, for example, if 1-pixel translated image produces a different result, the vision system would be considered unreliable and unstable.
Moreover, pixel sensitivity imbalance implies a positional difference for capturing a perturbation. Consider one-pixel attack Su et al. (2019), which attempts an adversarial attack to invert the output by perturbing a certain pixel. Here we can additionally exploit the fact that strong pixels are generally more sensitive. If we construct an attack strategy that focuses more on strong pixels, we can attack the neural network more easily.
Here, we provide a mathematical formulation. Consider output-like activations, we can represent the output using piece-wise linear function Srinivas and Fleuret (2018); Simonyan et al. (2013):
and are evaluated at specific image . Here, we approximate them using
which results in a fixed linear model, obtained from the mean over images. We define ,
which is the output from the fixed linear model.
Now, assume that we put a perturbation on . Then,
If pixel sensitivity imbalance exists, differs depending on the . Thus, even if the same amount of is applied, varies depending on where the perturbation is applied. For example, if we put perturbation to a strong pixel, the output can be significantly affected.
In addition, Eq. 12 implies that when pixel sensitivity imbalance exists, it may be difficult to distinguish whether the change in output is due to the magnitude of the perturbation or the position of the perturbation. Then, if perturbations with different magnitudes are applied at random locations, can CNN distinguish the magnitudes of perturbations? Further, if the perturbation magnitude and the position are also randomly varied every time, and only the average of the perturbation magnitude has a difference, it will be quite a challenging problem. However, these problems are commonly encountered in practical vision tasks.
Here, we propose a micro-object classification task. The templates are images from the Caltech-101 dataset. First, we select a random region within the template. After changing the color of the selected area to RGB=(0, 0, 0), we label the image as class A. In the same way, for class B, select a random area, but replace it with RGB=(255, 0, 0). As such, we put a micro-object at a random location in the image to perform a binary classification task. In this task, not only the position of the perturbation but also the magnitude changes every time. Here, when pixel sensitivity imbalance exists, it may be difficult for CNN to capture the difference in perturbation magnitude between the two classes. In contrast, suppose the pixel sensitivity imbalance is reduced by kernel padding. In that case, as the influence of the random position decreases, the change in the magnitude of perturbation can be more easily captured.
Experimental details such as training method and data augmentation are almost the same as in Section 3. Here, to better see the intrinsic architectural differences, we did not use pre-trained weights. The number of epochs was set to 50. The observed training curve is shown in Figure 5. Initially, the test accuracy was around , and the difference in perturbation was not captured. After a certain epoch, the test accuracy increased rapidly, and the difference in the micro-objects was captured with an accuracy of more than . Here, the existing model without kernel padding required more epochs to capture the perturbation difference. In contrast, the model to which kernel padding is applied captured the perturbation difference faster.
To verify this more strictly, we measured the number of epoch where the test accuracy first exceeded . If it did not exceed within 50 epochs, it was evaluated as 50 epochs. For ResNet-101 and its variant, five experiments were performed, and the average of the measured number of epochs was summarized (Table 4). Even in the same training environment, after kernel padding, the difference in the micro-objects was captured 13-18 epochs faster. Thus, for some special tasks, pixel sensitivity imbalance is harmful to training.
|Model||Before KP||After KP||Diff|
In this study, we investigated the behaviors of CNNs using effective receptive fields. First, we investigated the size of the receptive field. Contrary to popular belief, we found that the classification accuracy is not proportional to the size of the receptive field. In addition, we observed that the size of the effective receptive field saturates to a certain level even if the CNN becomes deeper. These observations suggest that we need to reconsider when controlling the size of the receptive field. Second, the pixels contributing to the output were investigated through the effective receptive field of the output. We discovered that in modern ResNets, the contribution to the output is different for each pixel. It was identified that the cause of this pixel sensitivity imbalance lies in the use of an odd-sized kernel with stride 2. To solve this, kernel padding was proposed. We quantitatively evaluated pixel sensitivity imbalance through two indices and found that pixel sensitivity imbalance decreases after kernel padding. We discussed that although the pixel sensitivity imbalance is a helpful feature for general vision tasks, it is a harmful bug for some tasks. These behaviors of CNNs should be understood and considered by practitioners.
- Computing receptive fields of convolutional neural networks. Distill 4 (11), pp. e21. Cited by: §1, §1, §2, §3, §3.
Why do deep convolutional networks generalize so poorly to small image transformations?.
Journal of Machine Learning Research20, pp. 1–25. Cited by: §5.
Grad-cam++: generalized gradient-based visual explanations for deep convolutional networks.
2018 IEEE winter conference on applications of computer vision (WACV), pp. 839–847. Cited by: §2.
- Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. Cited by: §5.
Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories.
2004 conference on computer vision and pattern recognition workshop, pp. 178–178. Cited by: §3.
Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2002–2011. Cited by: §1.
- DropBlock: a regularization method for convolutional networks. Advances in Neural Information Processing Systems 31, pp. 10727–10737. Cited by: §5.
- Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315–323. Cited by: §2.
- Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §1.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §3.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.
- SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §4.
Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, pp. 694–711. Cited by: §1.
- Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1646–1654. Cited by: §1.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §4.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
- Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §3.
- Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 4905–4913. Cited by: §1, §3.
- Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp. 116–131. Cited by: §4.
- LMFIT: non-linear least-square minimization and curve-fitting for python. Astrophysics Source Code Library, pp. ascl–1606. Cited by: §3.
- Deconvolution and checkerboard artifacts. Distill 1 (10), pp. e3. Cited by: §4.
Automatic differentiation in pytorch. Cited by: §3.
- Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: §3.
- Neural nearest neighbors networks. Advances in Neural Information Processing Systems 31, pp. 1087–1098. Cited by: §1.
- You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
- Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28, pp. 91–99. Cited by: §1.
- U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1.
- Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §3.
- Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §4.
- Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §2.
- Pv-rcnn: point-voxel feature set abstraction for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10529–10538. Cited by: §1.
- Learning important features through propagating activation differences. In International Conference on Machine Learning, pp. 3145–3153. Cited by: §5.
- TensorFlow-slim image classification model library. URL https://github. com/tensorflow/models/tree/master/research/slim. Cited by: §3.
- Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Cited by: §5, §5.
- An analysis of scale invariance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3578–3587. Cited by: §1.
- Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §5.
- Striving for simplicity: the all convolutional net. In ICLR (workshop track), Cited by: §5.
- Knowledge transfer with jacobian matching. In International Conference on Machine Learning, pp. 4723–4731. Cited by: §5.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §5.
One pixel attack for fooling deep neural networks.
IEEE Transactions on Evolutionary Computation23 (5), pp. 828–841. Cited by: §5.
- Axiomatic attribution for deep networks. In International Conference on Machine Learning, pp. 3319–3328. Cited by: §5.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. Cited by: §3.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §4.
- Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §1.
- Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7472–7481. Cited by: §1.
- The caltech-ucsd birds-200-2011 dataset. Cited by: §3.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §1, §3.
- Wide residual networks. In British Machine Vision Conference 2016, Cited by: §1, §3.
- Making convolutional networks shift-invariant again. In International conference on machine learning, pp. 7324–7334. Cited by: §5.