Biological vision adopts a coarse-to-fine information processing pathway, from initial visual detection and binding of salient features of a visual scene, to the enhanced and preferential processing given relevant stimuli. On the contrary, CNNs employ a fine-to-coarse processing, moving from local, edge-detecting filters to more global ones extracting abstract representations of the input. In this paper we reverse the feature extraction part of standard bottom-up architectures and turn them upside-down: We propose top-down networks. Our proposed coarse-to-fine pathway, by blurring higher frequency information and restoring it only at later stages, offers a line of defence against adversarial attacks that introduce high frequency noise. Moreover, since we increase image resolution with depth, the high resolution of the feature map in the final convolutional layer contributes to the explainability of the network's decision making process. This favors object-driven decisions over context driven ones, and thus provides better localized class activation maps. This paper offers empirical evidence for the applicability of the top-down resolution processing to various existing architectures on multiple visual tasks.READ FULL TEXT VIEW PDF
In human biological vision, perceptual grouping of visual features is based on Gestalt principles, where factors such as proximity, similarity or good continuation of features generate a salient percept . Salient objects are rapidly and robustly detected and segregated from the background in what is termed the “pop-out” effect [7, 22]. This initial detection and grouping of salient features into a coherent percept, leads to preferential processing by the visual system, described as stimulus-driven attention . For relevant visual stimuli, the exogenously directed attention is sustained, and results in a more detailed visual evaluation of the object. This typical pipeline of perception and attention allocation in biological vision represents an efficient, coarse-to-fine processing of information 
. In contrast, modern CNNs (Convolutional Neural Networks) do not incorporate this perspective[12, 23, 38, 40].
Standard CNNs begin with the high resolution input, and propagate information in a fine-to-coarse pathway. Early layers learn to extract local, shareable features, whereas deeper layers learn semantically rich and increasingly invariant representations. In this paper we propose the reversal of the conventional feature extraction of standard CNNs, as depicted in Figure 1. More specifically, we suggest the adoption of a coarse-to-fine processing of the input, which can be interpreted as gradual focusing of visual attention. The top-down hierarchy first extracts the gist of a scene, starting from a holistic initial representation, and subsequently enhances it with higher frequency information.
A growing body of literature since the seminal work of [10, 41] shows that adversarial perturbations with high-frequency components may cause substantial misclassifications. Suppressing higher frequencies in the input image, as proposed in our top-down paradigm, can provide a first line of defence. At the same time, explainability of the decision making process of CNNs has recently emerged as an important research direction [36, 54]. In this context, our coarse-to-fine processing scheme, having feature maps with higher spatial resolution at deeper layers, favors object-driven decisions over context-driven ones, and provides better localized class activation maps.
We make the following contributions: (i) We propose biologically inspired top-down network architectures, obtained by reversing the resolution processing of conventional bottom-up CNNs; (ii) We analyze various methods of building top-down networks based on bottom-up counterparts as well as the difference in resolution-processing between these models, providing a versatile framework that is directly applicable to existing architectures; (iii) We compare our proposed model against the baseline on a range of adversarial attacks and demonstrate enhanced robustness against certain types of attacks. (iv) We find enhanced explainability for our top-down model, with potential for object localization tasks. Trained models and source code for our experiments are available online: https://github.com/giannislelekas/topdown.
Coarse-to-fine processing. Coarse-to-fine processing is an integral part of efficient algorithms in computer vision. Iterative image registration  gradually refines registration from coarser variants of the original images, while in 8], and coarse-to-fine face alignment using stacked auto-encoders is introduced in . Efficient action recognition is achieved in 
by using coarse and fine features coming from two LSTM (Long Short-Term Memory) modules. In coarse-to-fine kernel networks are proposed, where a cascade of kernel networks are used with increasing complexity. Existing coarse-to-fine methods consider both coarse input resolution, as well as gradually refined processing. Here, we also focus on coarse-to-fine image resolution, however we are the first to do this in a single deep neural network, trained end-to-end, rather than in an ensemble.
Bottom-up and top-down pathways. Many approaches exploit high spatial resolution for finer feature localization, which is crucial for semantic segmentation. The U-net  and FPN (Feature Pyramid Networks)  merge information from bottom-up and top-down pathways, combining semantically rich information of the bottom-up with the fine localization of the top-down stream. Similarly, combinations of a high-resolution and a low-resolution branch were proposed for efficient action recognition , for face hallucination , and depth map prediction 
. Top-down signals are also used to model neural attention via a backpropagation algorithm, and to extract informative localization maps for classification tasks in Grad-CAM . Similarly, we also focus on top-down pathways where we slowly integrate higher levels of detail, however our goal is biologically-inspired resolution processing, rather than feature-map activation analysis.
Multi-scale networks. Merging and modulating information extracted from multiple scales is vastly popular [15, 21, 47, 48, 46]. In  feature maps are resized by a factor to obtain cascades of multiple resolutions. Incremental resolution changes during GAN (Generative Adversarial Network) training are proposed in . Convolutional weight sharing over multiple scales is proposed in [1, 47]. Similarly 
performs convolutions over multiple scales in combination with residual connections. In convolutions are performed over a grid of scales, thus combining information from multiple scales in one response, and  combines responses over multiples scales, where filters are defined using 2D Hermite polynomials with a Gaussian envelope. Spatial pyramid pooling is proposed in  for aggregating information at multiple scales. In this work, we also extract multi-resolution feature maps, in order to start processing from the lowest image scale and gradually restore high frequency information at deeper layers.
Beneficial effects of blurring. Suppressing high frequency information by blurring the input can lead to enhanced robustness [43, 53]. Models trained on blurred inputs exhibit increased robustness to distributional shift . The work in 
reveals the bias of CNNs towards texture, and analyzes the effect of blurring distortions on the proposed Stylized-ImageNet dataset. Anti-aliasing by blurring before downsampling contributes to preserving shift invariance in CNNs
. By using Gaussian kernels with learnable variance, adapts the receptive field size. Rather than changing the receptive field size, works such as [27, 26, 31] use spatial smoothing for improved resistance to adversarial attacks. Similarly, we also rely on Gaussian blurring before downsampling the feature maps to avoid aliasing effects, and as a consequence we observe improved robustness to adversarial attacks.
Top-down () networks mirror the baseline bottom-up () networks, and reverse their feature extraction pathway. Information flows in the opposite direction, moving from lower to higher resolution feature maps. The initial input of the network corresponds to the minimum spatial resolution occurring in the baseline network. Downscaling operations are replaced by upscaling, leading to the coarse-to-fine information flow. By upscaling, the network can merely “hallucinate” higher resolution features. To restore the high frequency information, we use resolution merges, which combine the hallucinated features with higher frequency inputs, after each upscaling operation. Figure 2 depicts the difference between the architecture and our proposed architecture.
To avoid artifacts hampering the performance of the network 
, we blur the inputs before downsampling. For the upsampling operation we use interpolation followed by convolution. We have experimented with both nearest neighbour and bilinear interpolation, and have noticed improved robustness against adversarial attacks for nearest neighbor interpolation. We have also considered the use of transpose convolutions, however we did not adopt these due to detrimental checkerboard artifacts.
Figure 3 depicts the considered method for merging the high resolution input with the low resolution information. We first upsample the low resolution input via a convolution and use an element-wise addition with the high-resolution branch. This information is then concatenated with the original high resolution information on the channel dimension. We subsequently use a convolution to expand the receptive field of the filters. The proposed merging of information slightly increases the number of parameters, while being effective in practice.
ERF (effective receptive field) size computation.Neurons in each layer of a typical bottom-up network has a single ERF size determined by the kernel size
and the cumulative stride(given stride at layer ).
Assuming only convolutions with stride 1, the example architecture in Figure 2 will have an ERF size of 3 pixels, and 18 pixels in each direction after the first and final convolutional layers, respectively. In contrast, for the network, considering a Gaussian blurring window of width , the lowest resolution branch will already have an ERF size of at the input level and of after the first convolutional layer (comparable to the final layer of a network already with pixels). Furthermore, in contrast to , output from neurons with varying ERFs are propagated through the merging points. To get a lower bound on the ERF sizes, we consider resolution merging methods which do not provide RF enlargement (e.g. as depicted in fig. 3, but without the convolution at the end). Thus, at the final merging point of the architecture, ERF sizes of 3 pixels and pixels are merged together. In conclusion, already from the first layer, has the ERF size that the only obtains at the last layer.
Feature extraction pathway of the network reverses the : information propagates from lower to higher spatial dimensions in a network, while the number of filters shrinks with increasing depth. The choice of expanding the number of filters at deeper layers in the network is efficiency-oriented. As the feature map resolution decreases, the number of channels increases, retaining the computational complexity roughly fixed per layer. Typically, in standard architectures the filters are doubled every time dimensions are halved [12, 38].
In our method we consider three options for deciding the number of filters per layer: the model which is exactly the opposite of the in that the number of channels are reduced with depth; the uniform model () where the layers have a uniform number of filters; and the reversed model () which follows the filter arrangement, with channel dimension widened with depth.
In Exp 1 we evaluate the three different filter arrangement options proposed for the top-down model. We compare these model variations with the bottom-up baseline on the MNIST, Fashion-MNIST and CIFAR10 classification tasks. In Exp 2 we evaluate the robustness of our proposed model against various adversarial attacks applied on the same datasets. Finally, in Exp 3 we illustrate the explainability capabilities of our top-down model when compared to the bottom-up, and demonstrate its benefits for a small object localization task.
Experimental setup. We compare our proposal with its respective baseline on MNIST, Fashion-MNIST and CIFAR10. For the simpler MNIST tasks we consider as baselines the “LeNetFC”, a fully-convolutional variant of LeNet  and following , a lightweight version of the NIN (Network-In-Network) architecture, namely “NIN-light” with reduced filters. The original architecture was used for the CIFAR10 task, along with the ResNet32 introduced in  incorporating the pre-activation unit of 17] is used in all the networks prior to the non-linearities. The corresponding networks are defined based on their baselines. Table 1 depicts the number of parameters of different models. For we consider three variants: – which is mirroring the architecture also in terms of filter depth; using uniform filter depth; and where the filter depth of the is reversed, thus following the filter depth of . There is an increase in the number of parameters for the networks, because we need additional convolutional layers for merging the high and low resolution information.
We abide by the setup found in the initial publications for the models. For the networks we performed a linear search for learning rate, batch size, and weight decay. For all cases we train with a 90/10 train/val split, using SGD with momentum of 0.9 and a 3-stage learning rate decay scheme, dividing the learning rate by 10 at and
of the total number of epochs. For the CIFAR10 dataset we test with and without augmentation—employing horizontal translations and flips. We repeat runs four times, with dataset reshuffling and extracting new training and validation splits, and report mean and standard deviation of the test accuracy.
Figure 4 shows the test accuracy of the considered models across datasets. The networks are on par with, and in some cases surpassing the corresponding baseline performance. When considering the different filter depth configurations, performs best due to increased representational power at higher scales, coming though at cost of increased complexity. The NIN architecture adopts a close to uniform filter arrangement, hence the three variants reach roughly the same performance. We adopt the variants henceforth, on account of the small gap in performance and reduced complexity. This experiment provides empirical evidence of the applicability of the proposed pipeline to different network architectures.
We evaluate the robustness of versus against various attacks, where we attack the test set of each dataset using the Foolbox . For all the attacks, the default parameters were used. To make the attack bound tighter, we repeat each attack three times and keep the worst case for each to define the minimum required perturbation for fooling the network.
Figure 6 provides for each attack, plots of loss in test accuracy versus the distance between the original and the perturbed input. networks are visibly more resilient against attacks introducing uncorrelated noise, due to the coarse-to-fine processing adopted, with downscaled inputs diminishing the noise. For attacks introducing correlated noise such as the “Pointwise” attack , the perturbed pixels tend to lie in smooth regions of the image. Thus each single pixel value of 0 (or 1) in a region of 1s (or 0s) essentially acts as a Dirac delta function. Based on the convolutional nature of CNNs this type of attack “pollutes” the input with imprints of the learned filters111For imperfect delta function, this yields blurred versions of the filters., which gradually span a greater part of the feature map as more convolutions are applied. Due to the highly correlated nature of the perturbation, the blurred downsampling can not completely eradicate the noise, but helps decrease the introduced pollution. On the contrary, for networks, the noise is directly propagated down the network. Additionally, the blurred downsampling wired in the network architecture offers enhanced robustness against blurring attacks, as the network encounters the input image at multiple scales during training, and is, thus, more resilient to resolution changes. Since anti-aliasing before downsampling is suggested to better preserve shift-invariance , we expected our networks to also be more robust against the “Spatial” attack . However, no enhanced robustness is reported for networks; a substantial difference in robustness is observed for ResNet32, which could be due to the performance gap measured in Exp 1 between the and its baseline. We also tested with the and variants of the ResNet32 architecture, with respective results provided in the appendix.
To get a better insight on robustness, we introduce the generated attacks to a single resolution branch of the networks using the NIN-light architecture on MNIST and Fashion-MNIST, and NIN on CIFAR10. This is displayed in figure 5. We feed the extracted perturbations to either the low, medium or high resolution input branch, as illustrated in the model architecture in figure 2. For the simpler MNIST task, the medium-resolution input of the network is the most vulnerable, which is mainly attributed to the absence of information in the high frequency region of the input’s spectrum. Moving to more challenging Fashion-MNIST and CIFAR10 tasks, the high frequency input becomes the easiest path for fooling the network. Please see the appendix for additional results when perturbing two inputs simultaneously.
(a) Grad-CAM heatmap visualizations. Grad-CAM  provides class-discriminative localization maps, based on the feature maps of a convolutional layer, highlighting the most informative features for the classification task. Here, we use the features of the last convolutional layer. The extracted heatmap is restored to the original image scale, thus producing a coarser map in the case of the whose feature map size at the final layer is smaller. On the contrary, for the corresponding scale of the feature maps matches the scale of the input, hence Grad-CAM outputs a finer map.
The Grad-CAM heatmaps corresponding to a and network are provided in figure 7. These are obtained from various layers of a ResNet18 architecture  trained on the Imagenette dataset . For further information about the setup please refer to the appendix. “Layer 1” corresponds to the activation of the input to the first group of residual blocks, and “Layer 2” to “Layer 5” to the activations of the output of each of these four groups, each one corresponding to different spatial resolution. The visualizations demonstrate that follows an opposite, coarse-to-fine path starting from a coarser representation and gradually enriching it with higher frequency information. Hence, networks do not only mirror the solely in the architectural design, but also in their learning process.
Additional heatmaps corresponding to correctly classified images, taken from the last convolutional layer of the networks are visualized in figure 8. The figures depict the coarse localization in versus the fine localization in . We selected intentionally images with multiple objects. The networks recognize objects based on fine-grained information: such as the spots on the dog, the cross on the church or boundary information of various objects.
(b) Weakly-supervised object localization. For a quantitative evaluation of the localization abilities of , we used the MNIST and Fashion-MNIST datasets and the NIN-light model as a backbone architecture. Figure 9 shows mean precision and recall scores for the and models over four runs. For each run models were trained from scratch, then TP (true positive), FP (false positive), FN (false negative) values were computed between the Grad-CAM heatmaps and the thresholded objects, corresponding to the test set of the considered task. We used a threshold empirically set to . Based on the computed values precision and recall scores were extracted and aggregated over the four runs. For a fair comparison only the samples correctly classified from both and were considered. The models report higher precision for both tasks considered, suggesting finer object localization. The lower recall scores for the Fashion-MNIST is attributed to the higher number of FN compared to the model. The larger object sizes of the Fashion-MNIST task, along with the coarse output of the model, being able to capture a greater extent of them, leads to fewer FN. On the contrary, the models focus on finer aspects of the objects, which are informative for the classification task. Considering the fine-grained focus in the Grad-CAM outputs and the potential for weakly-supervised object localization, networks comprise a promising direction for future research.
The current work aims at providing a fresh perspective on the architecture of CNNs, which is currently taken for granted. The coarse-to-fine pathway is biologically inspired by how humans perceive visual information: first understanding the context and then filling in the salient details.
One downside of our proposed networks is that expanding dimensions at increased network depth leads to memory and computational bottlenecks. This is due to the feature map size being larger at higher depths. Moreover, for the same reason, adding fully-connected layers before the output layer of the architectures leads to a vast increase in the number of model parameters. Hence, fully convolutional networks are preferable. This increase in memory is also more visible with large-scale datasets such as ImageNet . A simple workaround requiring no architectural adaptations would be to employ mixed-precision training, which would decrease the memory requirements, but would increase the computational complexity. Instead of increasing the spatial resolution of the feature maps at later depths, we could use patches of the input of limited sizes. The selection of these informative patches could be defined using the Grad-CAM heatmaps by selecting the high-activation areas of the heatmap, or considering self-attention mechanisms . In addition to addressing the aforementioned limitations, we find the weakly-supervised setting to be a promising area of future research.
In the current work, we revisit the architecture of conventional CNNs, aiming at diverging from the manner in which resolution is typically processed in deep networks. We propose novel network architectures which reverse the resolution processing of standard CNNs. The proposed paradigm adopts a coarse-to-fine information processing pathway, starting from the low resolution information, providing the visual context, and subsequently adding back the high frequency information. We empirically demonstrate the applicability of our proposed architectures when starting from a range of baseline architectures, and considering multiple visual recognition tasks. networks exhibit enhanced robustness against certain types of adversarial attacks. This resistance to adversarial attacks is induced directly by the network design choices. Additionally, the high spatial dimensions of the feature maps in the last layer significantly enhance the explainability of the model, and demonstrate potential for weakly-supervised object localization tasks.
Association for the Advancement of Artificial Intelligence (AAAI. Cited by: §2.
Foolbox: a python toolbox to benchmark the robustness of machine learning models. CoRR. Cited by: §4.2.
Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §1.
The entire set of results for the adversarial robustness experiment is provided in figure 10. “ShiftsAttack” is a variant of Spatial attack , introducing only spatial shifts. networks exhibit enhanced robustness against attacks introducing correlated/uncorrelated noise, as well as against blurring attacks.
Figure 11 presents the robustness results for the CIFAR10-augmented and the ResNet32 architecture variants. Clearly, and variants exhibit enhanced robustness against spatial attacks, however, they also have similar to the behaviour against other attacks. This can be attributed to the increased number of filters at greater depth of the network, or equivalently increased scale of feature maps, thus greater contribution of the finer scales to the final output. However, finer scales are much more vulnerable against attacks. All in all, the reversal of the network for the extraction of the variant is not solely efficiency driven, keeping a roughly fixed computational complexity across layers, but also contributes to the network’s robustness as well. Finally, we need to mention that the respective figure for the non-augmented CIFAR10 case tells the same story.
Next, figure 12 presents the respective results for reintroducing the perturbation to two of the inputs of the network. Clearly, the highest and medium scale inputs are the most vulnerable ones, except for the simpler case of the MNIST dataset. The absence or scarce information in the high frequency region, yields the medium and lowest scale inputs as the ones with the highest impact.
Imagenette  is a 10-class sub-problem of ImageNet , allowing experimentation with a more realistic task, without the high training times and computational costs required for training on a large scale dataset. A set of examples, along with their corresponding labels are provided in figure 13. The datasets contains a total of 9469, 3925 training and validation samples respectively. Training samples were resized to , from where random crops were extracted; validation samples were resized to .
We utilized a lighter version222dividing the filters of the original architecture by 2. of the ResNet18 architecture introduced in  for Imagenette training, as this is a 10-class sub-problem, incorporating the pre-activation unit of . Additionally, the stride and the kernel extent of the first convolution for depth initialization were set to and respectively. Regarding training, a crop is extracted from the original image, or its horizontal flip, while subtracting the per-pixel mean ; the color augmentation of  is also used. For the network a batch size of 128 is used and the network is trained for a total of 50 epochs with a starting learning rate of 0.1. As for the , increased memory footprint led to the reduction of the batch size to 64 and the adaptation of the starting learning rate and the total epochs to and 80. We trained with SGD with momentum of 0.9 and a weight decay of 0.001; we also adopted a 3-stage learning rate decay scheme, where the learning rate is divided by 10 at and of the total epochs. Regarding performance, outperformed the variant by roughly . Grad-CAM is finally utilized for generating class-discriminate localization maps of the most informative features.
Figure 14 displays some additional Grad-CAM visualizations. The visualizations are obtained by using a ResNet18 architecture for the networks and its corresponding variant. The original images are taken from the Imagenette dataset .
The model provides localized activations, focusing on certain informative aspects of the image, while the model focuses on large connected areas. Because of this difference we believe the model may be more precise than the model for tasks such as weakly-supervised object detection.