Top-Down Networks: A coarse-to-fine reimagination of CNNs

by   Ioannis Lelekas, et al.
Delft University of Technology

Biological vision adopts a coarse-to-fine information processing pathway, from initial visual detection and binding of salient features of a visual scene, to the enhanced and preferential processing given relevant stimuli. On the contrary, CNNs employ a fine-to-coarse processing, moving from local, edge-detecting filters to more global ones extracting abstract representations of the input. In this paper we reverse the feature extraction part of standard bottom-up architectures and turn them upside-down: We propose top-down networks. Our proposed coarse-to-fine pathway, by blurring higher frequency information and restoring it only at later stages, offers a line of defence against adversarial attacks that introduce high frequency noise. Moreover, since we increase image resolution with depth, the high resolution of the feature map in the final convolutional layer contributes to the explainability of the network's decision making process. This favors object-driven decisions over context driven ones, and thus provides better localized class activation maps. This paper offers empirical evidence for the applicability of the top-down resolution processing to various existing architectures on multiple visual tasks.


page 1

page 5

page 7

page 13

page 14

page 15


Detailed Dense Inference with Convolutional Neural Networks via Discrete Wavelet Transform

Dense pixelwise prediction such as semantic segmentation is an up-to-dat...

Channel Attention based Iterative Residual Learning for Depth Map Super-Resolution

Despite the remarkable progresses made in deep-learning based depth map ...

Coarse-Fine Networks for Temporal Activity Detection in Videos

In this paper, we introduce 'Coarse-Fine Networks', a two-stream archite...

Progressive Semantic Segmentation

The objective of this work is to segment high-resolution images without ...

EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

Vision Transformer (ViT) has achieved remarkable performance in many vis...

Biologically Inspired Hierarchical Model for Feature Extraction and Localization

Feature extraction and matching are among central problems of computer v...

Left-Right Skip-DenseNets for Coarse-to-Fine Object Categorization

Inspired by the recent neuroscience studies on the left-right asymmetry ...

Code Repositories

1 Introduction

In human biological vision, perceptual grouping of visual features is based on Gestalt principles, where factors such as proximity, similarity or good continuation of features generate a salient percept [42]. Salient objects are rapidly and robustly detected and segregated from the background in what is termed the “pop-out” effect [7, 22]. This initial detection and grouping of salient features into a coherent percept, leads to preferential processing by the visual system, described as stimulus-driven attention [52]. For relevant visual stimuli, the exogenously directed attention is sustained, and results in a more detailed visual evaluation of the object. This typical pipeline of perception and attention allocation in biological vision represents an efficient, coarse-to-fine processing of information [14]

. In contrast, modern CNNs (Convolutional Neural Networks) do not incorporate this perspective

[12, 23, 38, 40].

Figure 1:

A coarse-to-fine versus fine-to-coarse processing pathway. The conventional fine-to-coarse pathway in a CNN sacrifices localization for semantically richer information. The opposite path, proposed in this paper, starts from the coarsest input and focuses on the context: given the sky, grass and building, it is clearly a landscape scene of a building. Moving to finer representations of the input, the focus shifts to local information. Architectural aspects of the building, and the cross on the top, are now the most informative for classifying the image as a church. Our proposed coarse-to-fine pathway is in line with human biological vision, where detection of global features precedes the detection of local ones, for which further processing of the stimuli is required.

Standard CNNs begin with the high resolution input, and propagate information in a fine-to-coarse pathway. Early layers learn to extract local, shareable features, whereas deeper layers learn semantically rich and increasingly invariant representations. In this paper we propose the reversal of the conventional feature extraction of standard CNNs, as depicted in Figure 1. More specifically, we suggest the adoption of a coarse-to-fine processing of the input, which can be interpreted as gradual focusing of visual attention. The top-down hierarchy first extracts the gist of a scene, starting from a holistic initial representation, and subsequently enhances it with higher frequency information.

A growing body of literature since the seminal work of [10, 41] shows that adversarial perturbations with high-frequency components may cause substantial misclassifications. Suppressing higher frequencies in the input image, as proposed in our top-down paradigm, can provide a first line of defence. At the same time, explainability of the decision making process of CNNs has recently emerged as an important research direction [36, 54]. In this context, our coarse-to-fine processing scheme, having feature maps with higher spatial resolution at deeper layers, favors object-driven decisions over context-driven ones, and provides better localized class activation maps.

We make the following contributions: (i) We propose biologically inspired top-down network architectures, obtained by reversing the resolution processing of conventional bottom-up CNNs; (ii) We analyze various methods of building top-down networks based on bottom-up counterparts as well as the difference in resolution-processing between these models, providing a versatile framework that is directly applicable to existing architectures; (iii) We compare our proposed model against the baseline on a range of adversarial attacks and demonstrate enhanced robustness against certain types of attacks. (iv) We find enhanced explainability for our top-down model, with potential for object localization tasks. Trained models and source code for our experiments are available online:

2 Related work

Coarse-to-fine processing. Coarse-to-fine processing is an integral part of efficient algorithms in computer vision. Iterative image registration [30] gradually refines registration from coarser variants of the original images, while in [16]

a coarse-to-fine optical flow estimation method is proposed. Coarse-to-fine face detection is performed by processing increasingly larger edge arrangements in 

[8], and coarse-to-fine face alignment using stacked auto-encoders is introduced in [50]. Efficient action recognition is achieved in [44]

by using coarse and fine features coming from two LSTM (Long Short-Term Memory) modules. In 

[34] coarse-to-fine kernel networks are proposed, where a cascade of kernel networks are used with increasing complexity. Existing coarse-to-fine methods consider both coarse input resolution, as well as gradually refined processing. Here, we also focus on coarse-to-fine image resolution, however we are the first to do this in a single deep neural network, trained end-to-end, rather than in an ensemble.

Bottom-up and top-down pathways. Many approaches exploit high spatial resolution for finer feature localization, which is crucial for semantic segmentation. The U-net [33] and FPN (Feature Pyramid Networks) [29] merge information from bottom-up and top-down pathways, combining semantically rich information of the bottom-up with the fine localization of the top-down stream. Similarly, combinations of a high-resolution and a low-resolution branch were proposed for efficient action recognition [5], for face hallucination [25], and depth map prediction [3]

. Top-down signals are also used to model neural attention via a backpropagation algorithm

[49], and to extract informative localization maps for classification tasks in Grad-CAM [36]. Similarly, we also focus on top-down pathways where we slowly integrate higher levels of detail, however our goal is biologically-inspired resolution processing, rather than feature-map activation analysis.

Multi-scale networks. Merging and modulating information extracted from multiple scales is vastly popular [15, 21, 47, 48, 46]. In [48] feature maps are resized by a factor to obtain cascades of multiple resolutions. Incremental resolution changes during GAN (Generative Adversarial Network) training are proposed in [20]. Convolutional weight sharing over multiple scales is proposed in [1, 47]. Similarly [6]

performs convolutions over multiple scales in combination with residual connections. In

[21] convolutions are performed over a grid of scales, thus combining information from multiple scales in one response, and [39] combines responses over multiples scales, where filters are defined using 2D Hermite polynomials with a Gaussian envelope. Spatial pyramid pooling is proposed in [11] for aggregating information at multiple scales. In this work, we also extract multi-resolution feature maps, in order to start processing from the lowest image scale and gradually restore high frequency information at deeper layers.

Figure 2: Left: The bottom-up () baseline network. Feature maps decrease in spatial resolution with network depth. Right: The proposed top-down () network. The model reverses the feature extraction pathway of the baseline network. It employs three inputs from highest to lowest scale, starts processing from the lowest resolution and progressively adds high resolution information.

Beneficial effects of blurring. Suppressing high frequency information by blurring the input can lead to enhanced robustness [43, 53]. Models trained on blurred inputs exhibit increased robustness to distributional shift [19]. The work in [9]

reveals the bias of CNNs towards texture, and analyzes the effect of blurring distortions on the proposed Stylized-ImageNet dataset. Anti-aliasing by blurring before downsampling contributes to preserving shift invariance in CNNs


. By using Gaussian kernels with learnable variance,

[37] adapts the receptive field size. Rather than changing the receptive field size, works such as [27, 26, 31] use spatial smoothing for improved resistance to adversarial attacks. Similarly, we also rely on Gaussian blurring before downsampling the feature maps to avoid aliasing effects, and as a consequence we observe improved robustness to adversarial attacks.

3 Top-down networks

Top-down () networks mirror the baseline bottom-up () networks, and reverse their feature extraction pathway. Information flows in the opposite direction, moving from lower to higher resolution feature maps. The initial input of the network corresponds to the minimum spatial resolution occurring in the baseline network. Downscaling operations are replaced by upscaling, leading to the coarse-to-fine information flow. By upscaling, the network can merely “hallucinate” higher resolution features. To restore the high frequency information, we use resolution merges, which combine the hallucinated features with higher frequency inputs, after each upscaling operation. Figure 2 depicts the difference between the architecture and our proposed architecture.

3.1 Input and feature map resizing

To avoid artifacts hampering the performance of the network [51]

, we blur the inputs before downsampling. For the upsampling operation we use interpolation followed by convolution. We have experimented with both nearest neighbour and bilinear interpolation, and have noticed improved robustness against adversarial attacks for nearest neighbor interpolation. We have also considered the use of transpose convolutions, however we did not adopt these due to detrimental checkerboard artifacts.

3.2 Merging low and high resolution

Figure 3 depicts the considered method for merging the high resolution input with the low resolution information. We first upsample the low resolution input via a convolution and use an element-wise addition with the high-resolution branch. This information is then concatenated with the original high resolution information on the channel dimension. We subsequently use a convolution to expand the receptive field of the filters. The proposed merging of information slightly increases the number of parameters, while being effective in practice.

Figure 3: Merging low and high-frequency feature maps: we use a convolution followed by an element-wise addition; this information is concatenated with the high-resolution input and followed by a convolution that expands the receptive field size.

ERF (effective receptive field) size computation.Neurons in each layer of a typical bottom-up network has a single ERF size determined by the kernel size

and the cumulative stride

(given stride at layer ).


Assuming only convolutions with stride 1, the example architecture in Figure 2 will have an ERF size of 3 pixels, and 18 pixels in each direction after the first and final convolutional layers, respectively. In contrast, for the network, considering a Gaussian blurring window of width , the lowest resolution branch will already have an ERF size of at the input level and of after the first convolutional layer (comparable to the final layer of a network already with pixels). Furthermore, in contrast to , output from neurons with varying ERFs are propagated through the merging points. To get a lower bound on the ERF sizes, we consider resolution merging methods which do not provide RF enlargement (e.g. as depicted in fig. 3, but without the convolution at the end). Thus, at the final merging point of the architecture, ERF sizes of 3 pixels and pixels are merged together. In conclusion, already from the first layer, has the ERF size that the only obtains at the last layer.

3.3 Filter arrangement

Feature extraction pathway of the network reverses the : information propagates from lower to higher spatial dimensions in a network, while the number of filters shrinks with increasing depth. The choice of expanding the number of filters at deeper layers in the network is efficiency-oriented. As the feature map resolution decreases, the number of channels increases, retaining the computational complexity roughly fixed per layer. Typically, in standard architectures the filters are doubled every time dimensions are halved [12, 38].

In our method we consider three options for deciding the number of filters per layer: the model which is exactly the opposite of the in that the number of channels are reduced with depth; the uniform model () where the layers have a uniform number of filters; and the reversed model () which follows the filter arrangement, with channel dimension widened with depth.

4 Experiments

In Exp 1 we evaluate the three different filter arrangement options proposed for the top-down model. We compare these model variations with the bottom-up baseline on the MNIST, Fashion-MNIST and CIFAR10 classification tasks. In Exp 2 we evaluate the robustness of our proposed model against various adversarial attacks applied on the same datasets. Finally, in Exp 3 we illustrate the explainability capabilities of our top-down model when compared to the bottom-up, and demonstrate its benefits for a small object localization task.

Experimental setup. We compare our proposal with its respective baseline on MNIST, Fashion-MNIST and CIFAR10. For the simpler MNIST tasks we consider as baselines the “LeNetFC”, a fully-convolutional variant of LeNet [24] and following [28], a lightweight version of the NIN (Network-In-Network) architecture, namely “NIN-light” with reduced filters. The original architecture was used for the CIFAR10 task, along with the ResNet32 introduced in [12] incorporating the pre-activation unit of [13]

. Batch Normalization

[17] is used in all the networks prior to the non-linearities. The corresponding networks are defined based on their baselines. Table 1 depicts the number of parameters of different models. For we consider three variants: – which is mirroring the architecture also in terms of filter depth; using uniform filter depth; and where the filter depth of the is reversed, thus following the filter depth of . There is an increase in the number of parameters for the networks, because we need additional convolutional layers for merging the high and low resolution information.

We abide by the setup found in the initial publications for the models. For the networks we performed a linear search for learning rate, batch size, and weight decay. For all cases we train with a 90/10 train/val split, using SGD with momentum of 0.9 and a 3-stage learning rate decay scheme, dividing the learning rate by 10 at and

of the total number of epochs. For the CIFAR10 dataset we test with and without augmentation—employing horizontal translations and flips. We repeat runs four times, with dataset reshuffling and extracting new training and validation splits, and report mean and standard deviation of the test accuracy.

Model #parameters
LeNetFC 8k 14k 23k 58k
NIN-light 62k 213k 215k 214k
ResNet32 468k 528k 320k 563k
NIN 970k 3,368k 3,397k 3,388k
Table 1: Number of trainable parameters for the different architectures considered. Different rows correspond to the different baseline architectures and columns indicate the bottom-up model and the three top-down variants with different filter arrangements (section 3.3). There is an increase in the number of parameters for the networks, because they merge the high and low resolution information using additional convolutional layers.

4.1 Exp. 1: Bottom-up versus top-down

Figure 4: Exp 1: Comparison of MNIST, Fashion-MNIST, CIFAR10, and CIFAR10_aug (with augmentation) mean test accuracies between and the three different configurations of proposed in subsection 3.3. networks perform on par with, and at times surpassing, the baseline performance of its respective . Regarding filter depth configurations, displays the highest performance, at the cost of increased parameters. Considering the small gap in performance and the increased cost for , we henceforth adopt the configuration.
Figure 5: Exp 2: Test accuracy when extracted adversarial perturbations are fed to either the highest, medium, or lowest scale input of the network (refer to figure 2), using the NIN-light baseline on MNIST and Fashion-MNIST, and NIN on CIFAR10. The remaining two inputs are fed the original, unperturbed samples. As the dataset becomes more challenging, the highest vulnerability moves from the medium input to the highest scale input. This is attributed to the absence of information in the high frequency region for the simpler cases: i.e. MNIST. (See the appendix for additional results.)

Figure 4 shows the test accuracy of the considered models across datasets. The networks are on par with, and in some cases surpassing the corresponding baseline performance. When considering the different filter depth configurations, performs best due to increased representational power at higher scales, coming though at cost of increased complexity. The NIN architecture adopts a close to uniform filter arrangement, hence the three variants reach roughly the same performance. We adopt the variants henceforth, on account of the small gap in performance and reduced complexity. This experiment provides empirical evidence of the applicability of the proposed pipeline to different network architectures.

Figure 6: Exp 2: Comparison of adversarial robustness considering different datasets, models and attacks. The x-axis of each figure corresponds to the distance between the original and the perturbed image and the y-axis is the introduced loss in test accuracy. A lower curve suggests increased robustness. Green curves corresponding to are consistently underneath the respective red curves of the networks, for most attacks. The networks are more robust against both correlated and uncorrelated noise attacks due to the coarse-to-fine processing, suppressing high frequency information on earlier stages. Additionally, the blurred downsampling offers enhanced robustness against blurring attacks. For spatial attacks, we see no increased robustness. (See the appendix for additional results.)

4.2 Exp. 2: Adversarial robustness

We evaluate the robustness of versus against various attacks, where we attack the test set of each dataset using the Foolbox [32]. For all the attacks, the default parameters were used. To make the attack bound tighter, we repeat each attack three times and keep the worst case for each to define the minimum required perturbation for fooling the network.

Figure 6 provides for each attack, plots of loss in test accuracy versus the distance between the original and the perturbed input. networks are visibly more resilient against attacks introducing uncorrelated noise, due to the coarse-to-fine processing adopted, with downscaled inputs diminishing the noise. For attacks introducing correlated noise such as the “Pointwise” attack [35], the perturbed pixels tend to lie in smooth regions of the image. Thus each single pixel value of 0 (or 1) in a region of 1s (or 0s) essentially acts as a Dirac delta function. Based on the convolutional nature of CNNs this type of attack “pollutes” the input with imprints of the learned filters111For imperfect delta function, this yields blurred versions of the filters., which gradually span a greater part of the feature map as more convolutions are applied. Due to the highly correlated nature of the perturbation, the blurred downsampling can not completely eradicate the noise, but helps decrease the introduced pollution. On the contrary, for networks, the noise is directly propagated down the network. Additionally, the blurred downsampling wired in the network architecture offers enhanced robustness against blurring attacks, as the network encounters the input image at multiple scales during training, and is, thus, more resilient to resolution changes. Since anti-aliasing before downsampling is suggested to better preserve shift-invariance [51], we expected our networks to also be more robust against the “Spatial” attack [4]. However, no enhanced robustness is reported for networks; a substantial difference in robustness is observed for ResNet32, which could be due to the performance gap measured in Exp 1 between the and its baseline. We also tested with the and variants of the ResNet32 architecture, with respective results provided in the appendix.

To get a better insight on robustness, we introduce the generated attacks to a single resolution branch of the networks using the NIN-light architecture on MNIST and Fashion-MNIST, and NIN on CIFAR10. This is displayed in figure 5. We feed the extracted perturbations to either the low, medium or high resolution input branch, as illustrated in the model architecture in figure 2. For the simpler MNIST task, the medium-resolution input of the network is the most vulnerable, which is mainly attributed to the absence of information in the high frequency region of the input’s spectrum. Moving to more challenging Fashion-MNIST and CIFAR10 tasks, the high frequency input becomes the easiest path for fooling the network. Please see the appendix for additional results when perturbing two inputs simultaneously.

4.3 Exp 3: Explainability and localization

Figure 7: Exp 3.(a): Fine-to-coarse versus coarse-to-fine processing. We show Grad-CAM heatmaps for ResNet18 versus its respective , trained on the Imagenette dataset [18] for a random validation image. Higher layer index means increased depth in the architecture: “Layer 1” corresponds to the activation of the input to the first group of residual blocks, and “Layer 2” to “Layer 5” to the activations of the output of each of these four groups, each one corresponding to different spatial resolution. Top: the network, employing fine-to-coarse processing. Bottom: the respective network following the opposite path, starting with a holistic representation and gradually adding higher frequency information in deeper layers.
Figure 8: Exp 3.(a): Grad-CAM heatmaps corresponding to the last convolutional layer in the network. Top: The original input image, randomly selected from the validation set. Middle: Corresponding Grad-CAM heatmaps for the ResNet18. Bottom: Grad-CAM heatmaps for the ResNet18. Contrary to the coarse output of the , the network outputs high frequency feature maps, based on which the final classification is performed. recognized objects based on their fine-grained attributes: such as the spots on the dogs, or the cross on the church, or shape information. (See the appendix for additional results.)
Figure 9: Exp 3.(b):Precision and recall for the MNIST and Fashion-MNIST datasets using the NIN-light architecture. The numbers are reported over four runs and we also plot standard deviations. For each run, models are trained from scratch and the set of TP (true positive), FP (false positive), FN (false negative) is computed, between the Grad-CAM heatmaps and the segregated objects. The model has higher precision on both MNIST and Fashion-MNIST due to more accurate object localization, while having slightly lower recall than on the Fashion-MNIST.

(a) Grad-CAM heatmap visualizations. Grad-CAM [36] provides class-discriminative localization maps, based on the feature maps of a convolutional layer, highlighting the most informative features for the classification task. Here, we use the features of the last convolutional layer. The extracted heatmap is restored to the original image scale, thus producing a coarser map in the case of the whose feature map size at the final layer is smaller. On the contrary, for the corresponding scale of the feature maps matches the scale of the input, hence Grad-CAM outputs a finer map.

The Grad-CAM heatmaps corresponding to a and network are provided in figure 7. These are obtained from various layers of a ResNet18 architecture [12] trained on the Imagenette dataset [18]. For further information about the setup please refer to the appendix. “Layer 1” corresponds to the activation of the input to the first group of residual blocks, and “Layer 2” to “Layer 5” to the activations of the output of each of these four groups, each one corresponding to different spatial resolution. The visualizations demonstrate that follows an opposite, coarse-to-fine path starting from a coarser representation and gradually enriching it with higher frequency information. Hence, networks do not only mirror the solely in the architectural design, but also in their learning process.

Additional heatmaps corresponding to correctly classified images, taken from the last convolutional layer of the networks are visualized in figure 8. The figures depict the coarse localization in versus the fine localization in . We selected intentionally images with multiple objects. The networks recognize objects based on fine-grained information: such as the spots on the dog, the cross on the church or boundary information of various objects.

(b) Weakly-supervised object localization. For a quantitative evaluation of the localization abilities of , we used the MNIST and Fashion-MNIST datasets and the NIN-light model as a backbone architecture. Figure 9 shows mean precision and recall scores for the and models over four runs. For each run models were trained from scratch, then TP (true positive), FP (false positive), FN (false negative) values were computed between the Grad-CAM heatmaps and the thresholded objects, corresponding to the test set of the considered task. We used a threshold empirically set to . Based on the computed values precision and recall scores were extracted and aggregated over the four runs. For a fair comparison only the samples correctly classified from both and were considered. The models report higher precision for both tasks considered, suggesting finer object localization. The lower recall scores for the Fashion-MNIST is attributed to the higher number of FN compared to the model. The larger object sizes of the Fashion-MNIST task, along with the coarse output of the model, being able to capture a greater extent of them, leads to fewer FN. On the contrary, the models focus on finer aspects of the objects, which are informative for the classification task. Considering the fine-grained focus in the Grad-CAM outputs and the potential for weakly-supervised object localization, networks comprise a promising direction for future research.

5 Discussion

The current work aims at providing a fresh perspective on the architecture of CNNs, which is currently taken for granted. The coarse-to-fine pathway is biologically inspired by how humans perceive visual information: first understanding the context and then filling in the salient details.

One downside of our proposed networks is that expanding dimensions at increased network depth leads to memory and computational bottlenecks. This is due to the feature map size being larger at higher depths. Moreover, for the same reason, adding fully-connected layers before the output layer of the architectures leads to a vast increase in the number of model parameters. Hence, fully convolutional networks are preferable. This increase in memory is also more visible with large-scale datasets such as ImageNet [2]. A simple workaround requiring no architectural adaptations would be to employ mixed-precision training, which would decrease the memory requirements, but would increase the computational complexity. Instead of increasing the spatial resolution of the feature maps at later depths, we could use patches of the input of limited sizes. The selection of these informative patches could be defined using the Grad-CAM heatmaps by selecting the high-activation areas of the heatmap, or considering self-attention mechanisms [45]. In addition to addressing the aforementioned limitations, we find the weakly-supervised setting to be a promising area of future research.

6 Conclusion

In the current work, we revisit the architecture of conventional CNNs, aiming at diverging from the manner in which resolution is typically processed in deep networks. We propose novel network architectures which reverse the resolution processing of standard CNNs. The proposed paradigm adopts a coarse-to-fine information processing pathway, starting from the low resolution information, providing the visual context, and subsequently adding back the high frequency information. We empirically demonstrate the applicability of our proposed architectures when starting from a range of baseline architectures, and considering multiple visual recognition tasks. networks exhibit enhanced robustness against certain types of adversarial attacks. This resistance to adversarial attacks is induced directly by the network design choices. Additionally, the high spatial dimensions of the feature maps in the last layer significantly enhance the explainability of the model, and demonstrate potential for weakly-supervised object localization tasks.


  • [1] S. Aich, M. Yamazaki, Y. Taniguchi, and I. Stavness (2020) Multi-scale weight sharing network for image recognition. Pattern Recognition Letters 131, pp. 348–354. Cited by: §2.
  • [2] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, Cited by: §B.1, §5.
  • [3] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pp. 2366–2374. Cited by: §2.
  • [4] L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry (2017) Exploring the landscape of spatial robustness. CoRR. Cited by: Appendix A, §4.2.
  • [5] Q. Fan, C. R. Chen, H. Kuehne, M. Pistoia, and D. Cox (2019) More is less: learning efficient video representations by big-little network and depthwise temporal aggregation. In Advances in Neural Information Processing Systems, pp. 2261–2270. Cited by: §2.
  • [6] Y. Fan, J. Yu, D. Liu, and T. S. Huang (2020) Scale-wise convolution for image restoration.

    Association for the Advancement of Artificial Intelligence (AAAI

    Cited by: §2.
  • [7] D. J. Field, A. Hayes, and R. F. Hess (1993) Contour integration by the human visual system: evidence for a local “association field”. Vision research 33 (2), pp. 173–193. Cited by: §1.
  • [8] F. Fleuret and D. Geman (2001) Coarse-to-fine face detection. International Journal of Computer Vision 41 (1-2), pp. 85–107. Cited by: §2.
  • [9] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2019) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. International Conference on Learning Representations. Cited by: §2.
  • [10] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. International Conference on Learning Representations. Cited by: §1.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1904–1916. Cited by: §2.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §B.1, §1, §3.3, §4.3, §4.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on Computer Vision, pp. 630–645. Cited by: §B.1, §4.
  • [14] J. Hegdé (2008) Time course of visual perception: coarse-to-fine processing and beyond. Progress in neurobiology. Cited by: §1.
  • [15] S. Honari, J. Yosinski, P. Vincent, and C. Pal (2016) Recombinator networks: learning coarse-to-fine feature aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5743–5752. Cited by: §2.
  • [16] Y. Hu, R. Song, and Y. Li (2016) Efficient coarse-to-fine patchmatch for large displacement optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5704–5712. Cited by: §2.
  • [17] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR. Cited by: §4.
  • [18] F. Jeremy Howard The imagenette dataset. GitHub. Note: Cited by: Figure 13, Figure 14, §B.1, §B.2, Figure 7, §4.3.
  • [19] J. Jo and Y. Bengio (2017) Measuring the tendency of cnns to learn surface statistical regularities. CoRR. Cited by: §2.
  • [20] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Representations. Cited by: §2.
  • [21] T. Ke, M. Maire, and S. X. Yu (2017) Multigrid neural architectures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6665–6673. Cited by: §2.
  • [22] I. Kovács and B. Julesz (1993) A closed curve is much more than an incomplete one: effect of closure in figure-ground segmentation.. PNAS. Cited by: §1.
  • [23] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, Cited by: §B.1, §1.
  • [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §4.
  • [25] M. Li, Y. Sun, Z. Zhang, and J. Yu (2018) A coarse-to-fine face hallucination method by exploiting facial prior knowledge. In International Conference on Image Processing (ICIP), pp. 61–65. Cited by: §2.
  • [26] X. Li and F. Li (2017) Adversarial examples detection in deep networks with convolutional filter statistics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5764–5772. Cited by: §2.
  • [27] B. Liang, H. Li, M. Su, X. Li, W. Shi, and X. Wang (2017) Detecting adversarial examples in deep networks with adaptive noise reduction. CoRR. Cited by: §2.
  • [28] M. Lin, Q. Chen, and S. Yan (2014) Network in network. International Conference on Learning Representations. Cited by: §4.
  • [29] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: §2.
  • [30] B. D. Lucas, T. Kanade, et al. (1981) An iterative image registration technique with an application to stereo vision. Cited by: §2.
  • [31] R. Raju and M. Lipasti (2019) BlurNet: defense by filtering the feature maps. CoRR. Cited by: §2.
  • [32] J. Rauber, W. Brendel, and M. Bethge (2017)

    Foolbox: a python toolbox to benchmark the robustness of machine learning models

    CoRR. Cited by: §4.2.
  • [33] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.
  • [34] H. Sahbi (2017) Coarse-to-fine deep kernel networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1131–1139. Cited by: §2.
  • [35] L. Schott, J. Rauber, M. Bethge, and W. Brendel (2018) Towards the first adversarially robust neural network model on mnist. CoRR. Cited by: §4.2.
  • [36] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International conference on Computer Iision, Cited by: §1, §2, §4.3.
  • [37] E. Shelhamer, D. Wang, and T. Darrell (2019) Blurring the line between structure and learning to optimize and adapt receptive fields. CoRR. Cited by: §2.
  • [38] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations. Cited by: §1, §3.3.
  • [39] I. Sosnovik, M. Szmaja, and A. Smeulders (2020) Scale-equivariant steerable networks. International Conference on Learning Representations. Cited by: §2.
  • [40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §1.
  • [41] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. International Conference on Learning Representations. Cited by: §1.
  • [42] J. Wagemans, J. H. Elder, M. Kubovy, S. E. Palmer, M. A. Peterson, M. Singh, and R. von der Heydt (2012) A century of gestalt psychology in visual perception: i. perceptual grouping and figure-ground organization.. Psychological bulletin. Cited by: §1.
  • [43] H. Wang, X. Wu, P. Yin, and E. P. Xing (2019) High frequency component helps explain the generalization of convolutional neural networks. CoRR. Cited by: §2.
  • [44] Z. Wu, C. Xiong, Y. Jiang, and L. S. Davis (2019) LiteEval: a coarse-to-fine framework for resource efficient video recognition. In Advances in Neural Information Processing Systems, pp. 7778–7787. Cited by: §2.
  • [45] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, Cited by: §5.
  • [46] Y. Xu, T. Xiao, J. Zhang, K. Yang, and Z. Zhang (2014) Scale-invariant convolutional neural networks. CoRR. Cited by: §2.
  • [47] T. Yang, S. Zhu, S. Yan, M. Zhang, A. Willis, and C. Chen (2019) A closer look at network resolution for efficient network design. CoRR. Cited by: §2.
  • [48] C. Ye, C. Devaraj, M. Maynord, C. Fermüller, and Y. Aloimonos (2018) Evenly cascaded convolutional networks. In 2018 IEEE International Conference on Big Data (Big Data), pp. 4640–4647. Cited by: §2.
  • [49] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff (2016) Top-down neural attention by excitation backprop. In European Conference on Computer Vision, Cited by: §2.
  • [50] J. Zhang, S. Shan, M. Kan, and X. Chen (2014) Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment. In European conference on computer vision, pp. 1–16. Cited by: §2.
  • [51] R. Zhang (2019) Making convolutional networks shift-invariant again. International Conference on Machine Learning. Cited by: §2, §3.1, §4.2.
  • [52] X. Zhang, L. Zhaoping, T. Zhou, and F. Fang (2012) Neural Activities in V1 Create a Bottom-Up Saliency Map. Neuron. Cited by: §1.
  • [53] Z. Zhang, C. Jung, and X. Liang (2019) Adversarial defense by suppressing high-frequency components. CoRR. Cited by: §2.
  • [54] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1.


Appendix A Exp 2: Adversarial robustness

Figure 10: Exp:2 Complete set of results for the second experiment. Plots of test accuracy loss versus the distance between original and perturbed input, where each column corresponds to a different task. networks exhibit enhanced robustness against correlated/uncorrelated noise and blurring attacks.
Figure 11: Exp 2: Test accuracy loss versus the distance between original and perturbed input, for the CIFAR10-augmented and the ResNet32 architectures. Robustness is enhanced for the spatial attacks, but in general and variants exhibit similar behaviour to the baseline, which can be attributed to the increased filters at deeper layers.

The entire set of results for the adversarial robustness experiment is provided in figure 10. “ShiftsAttack” is a variant of Spatial attack [4], introducing only spatial shifts. networks exhibit enhanced robustness against attacks introducing correlated/uncorrelated noise, as well as against blurring attacks.

Figure 11 presents the robustness results for the CIFAR10-augmented and the ResNet32 architecture variants. Clearly, and variants exhibit enhanced robustness against spatial attacks, however, they also have similar to the behaviour against other attacks. This can be attributed to the increased number of filters at greater depth of the network, or equivalently increased scale of feature maps, thus greater contribution of the finer scales to the final output. However, finer scales are much more vulnerable against attacks. All in all, the reversal of the network for the extraction of the variant is not solely efficiency driven, keeping a roughly fixed computational complexity across layers, but also contributes to the network’s robustness as well. Finally, we need to mention that the respective figure for the non-augmented CIFAR10 case tells the same story.

Next, figure 12 presents the respective results for reintroducing the perturbation to two of the inputs of the network. Clearly, the highest and medium scale inputs are the most vulnerable ones, except for the simpler case of the MNIST dataset. The absence or scarce information in the high frequency region, yields the medium and lowest scale inputs as the ones with the highest impact.

Figure 12: Exp 2: Reintroducing perturbations to two of the inputs of model when using a NIN-light backbone for MNIST and Fashion-MNIST, and the NIN backbone for CIFAR10. Clearly, perturbing the two highest scale inputs, “high-medium” has the highest impact. Regarding the case of the simpler MNIST and the information gathered in the low to mid frequency region, the medium and the lowest scale input have the highest impact instead.

Appendix B Exp 3.(a): Explainability

b.1 Imagenette training

Imagenette [18] is a 10-class sub-problem of ImageNet [2], allowing experimentation with a more realistic task, without the high training times and computational costs required for training on a large scale dataset. A set of examples, along with their corresponding labels are provided in figure 13. The datasets contains a total of 9469, 3925 training and validation samples respectively. Training samples were resized to , from where random crops were extracted; validation samples were resized to .

We utilized a lighter version222dividing the filters of the original architecture by 2. of the ResNet18 architecture introduced in [12] for Imagenette training, as this is a 10-class sub-problem, incorporating the pre-activation unit of [13]. Additionally, the stride and the kernel extent of the first convolution for depth initialization were set to and respectively. Regarding training, a crop is extracted from the original image, or its horizontal flip, while subtracting the per-pixel mean [23]; the color augmentation of [23] is also used. For the network a batch size of 128 is used and the network is trained for a total of 50 epochs with a starting learning rate of 0.1. As for the , increased memory footprint led to the reduction of the batch size to 64 and the adaptation of the starting learning rate and the total epochs to and 80. We trained with SGD with momentum of 0.9 and a weight decay of 0.001; we also adopted a 3-stage learning rate decay scheme, where the learning rate is divided by 10 at and of the total epochs. Regarding performance, outperformed the variant by roughly . Grad-CAM is finally utilized for generating class-discriminate localization maps of the most informative features.

b.2 Grad-CAM heatmap visualizations

Figure 14 displays some additional Grad-CAM visualizations. The visualizations are obtained by using a ResNet18 architecture for the networks and its corresponding variant. The original images are taken from the Imagenette dataset [18].

The model provides localized activations, focusing on certain informative aspects of the image, while the model focuses on large connected areas. Because of this difference we believe the model may be more precise than the model for tasks such as weakly-supervised object detection.

Figure 13: Exp 3.(a): Validation samples from the Imagenette dataset [18], along with their corresponding ground truth labels. Samples are resized to .
Figure 14: Exp 3.(a): Grad-CAM heatmaps visualization on validation images from the Imagenette dataset [18], using a ResNet18 architecture for and its corresponding variant. All images are correctly classified. Rows 1 and 4 show the original Imagenette images; rows 2 and 5 show the heatmaps, while rows 3 and 6 visualize the heatmaps. Focusing on local information rather than global information, may help the to be more precise for object detection than the model.