Spatially Adaptive Computation Time for Residual Networks

12/07/2016 ∙ by Michael Figurnov, et al. ∙ Google Carnegie Mellon University 0

This paper proposes a deep learning architecture based on Residual Network that dynamically adjusts the number of executed layers for the regions of the image. This architecture is end-to-end trainable, deterministic and problem-agnostic. It is therefore applicable without any modifications to a wide range of computer vision problems such as image classification, object detection and image segmentation. We present experimental results showing that this model improves the computational efficiency of Residual Networks on the challenging ImageNet classification and COCO object detection datasets. Additionally, we evaluate the computation time maps on the visual saliency dataset cat2000 and find that they correlate surprisingly well with human eye fixation positions.



There are no comments yet.


page 1

page 7

page 8

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional networks gained a wide adoption in the image classification problem [24, 39, 40] due to their exceptional accuracy. In recent years deep convolutional networks have become an integral part of state-of-the-art systems for a diverse set of computer vision problems such as object detection [35], image segmentation [33], image-to-text [23, 43], visual question answering [11] and image generation [9]. They have also been shown to be surprisingly effective in non-vision domains, e.g

. natural language processing 

[45] and analyzing the board in the game of Go [38].

A major drawback of deep convolutional networks is their huge computational cost. A natural way to tackle this issue is by using attention to guide the computation, which is similar to how biological vision systems operate [36]

. Glimpse-based attention models 

[27, 34, 2, 21]

assume that the problem at hand can be solved by carefully processing a small number of typically rectangular sub-regions of the image. This makes such models unsuitable for multi-output problems (generating box proposals in object detection) and per-pixel prediction problems (image segmentation, image generation). Additionally, choosing the glimpse positions requires designing a separate prediction network or a heuristic procedure 

[1]. On the other hand, soft spatial attention models [43, 37] do not allow to save computation since they require evaluating the model at all spatial positions to choose per-position attention weights.

Figure 1: Left: object detections. Right: feature extractor SACT ponder cost (computation time) map for a COCO validation image. The proposed method learns to allocate more computation for the object-like regions of the image.

We build upon the Adaptive Computation Time (ACT) [12]

mechanism which was recently proposed for Recurrent Neural Networks (RNNs). We show that ACT can be applied to dynamically choose the number of evaluated layers in Residual Network 

[16, 17] (the similarity between Residual Networks and RNNs was explored in  [30, 13]). Next, we propose Spatially Adaptive Computation Time (SACT) which adapts the amount of computation between spatial positions. While we use SACT mechanism for Residual Networks, it can potentially be used for convolutional LSTM [42] models for video processing [29].

SACT is an end-to-end trainable architecture that incorporates attention into Residual Networks. It learns a deterministic policy that stops computation in a spatial position as soon as the features become “good enough”. Since SACT maintains the alignment between the image and the feature maps, it is well-suited for a wide range of computer vision problems, including multi-output and per-pixel prediction problems.

We evaluate the proposed models on the ImageNet classification problem [8] and find that SACT outperforms both ACT and non-adaptive baselines. Then, we use SACT as a feature extractor in the Faster R-CNN object detection pipeline [35] and demonstrate results on the challenging COCO dataset [32]. Example detections and a ponder cost (computation time) map are presented in fig. 1. SACT achieves significantly superior FLOPs-quality trade-off to the non-adaptive ResNet model. Finally, we demonstrate that the obtained computation time maps are well-correlated with human eye fixations positions, suggesting that a reasonable attention model arises in the model automatically without any explicit supervision.

2 Method

We begin by outlining the recently proposed deep convolutional model Residual Network (ResNet) [16, 17]. Then, we present Adaptive Computation Time, a model which adaptively chooses the number of residual units in ResNet. Finally, we show how this idea can be applied at the spatial position level to obtain Spatially Adaptive Computation Time model.

2.1 Residual Network

Figure 2: Residual Network (ResNet) with 101 convolutional layers. Each residual unit contains three convolutional layers. We apply Adaptive Computation Time to each block of ResNet to learn an image-dependent policy of stopping the computation.

We first describe the ResNet-101 ImageNet classification architecture (fig. 2). It has been extended for object detection [16, 7] and image segmentation [6]

problems. The models we propose are general and can be applied to any ResNet architecture. The first two layers of ResNet-101 are a convolution and a max-pooling layer which together have a total stride of four. Then, a sequence of four blocks is stacked together, each block consisting of multiple stacked

residual units. ResNet-101 contains four blocks with 3, 4, 23 and 3 units, respectively. A residual unit has a form , where the first term is called a shortcut connection and the second term is a residual function. A residual function consists of three convolutional layers: layer that reduces the number of channels, layer that has equal number of input and output channels and layer that restores the number of channels. We use pre-activation ResNet [17]

in which each convolutional layer is preceded by batch normalization 


and ReLU non-linearity. The first units in blocks 2-4 have a stride of 2 and increases the number of output channels by a factor of 2. All other units have equal input and output dimensions. This design choice follows Very Deep Networks 

[39] and ensures that all units in the network have an equal computational cost (except for the first units of blocks 2-4 having a slightly higher cost).

Finally, the obtained feature map is passed through a global average pooling layer [31]

and a fully-connected layer that outputs the logits of class probabilities. The global average pooling ensures that the network is

fully convolutional meaning that it can be applied to images of varying resolutions without changing the network’s parameters.

2.2 Adaptive Computation Time

Figure 3: Adaptive Computation Time (ACT) for one block of residual units. The computation halts as soon as the cumulative sum of the halting score reaches 1. The remainder is , the number of evaluated units , and the ponder cost is . See alg. 1. ACT provides a deterministic and end-to-end learnable policy of choosing the amount of computation.

Let us first informally explain Adaptive Computation Time (ACT) before describing it in more detail and providing an algorithm. We add a branch to the outputs of each residual unit which predicts a halting score, a scalar value in the range . The residual units and the halting scores are evaluated sequentially, as shown in fig. 3. As soon as the cumulative sum of the halting score reaches one, all following residual units in this block will be skipped. We set the halting distribution to be the evaluated halting scores with the last value replaced by a remainder. This ensures that the distribution over the values of the halting scores sums to one. The output of the block is then re-defined as a weighted sum of the outputs of residual units, where the weight of each unit is given by the corresponding probability value. Finally, a ponder cost is introduced that is the number of evaluated residual units plus the remainder value. Minimizing the ponder cost increases the halting scores of the non-last residual units making it more likely that the computation would stop earlier. The ponder cost is then multiplied by a constant

and added to the original loss function. ACT is applied to each block of ResNet independently with the ponder costs summed.

Formally, we consider a block of

residual units (boldface denotes tensors of shape Height

Width Channels):


We introduce a halting score for each residual unit. We define to enforce stopping after the last unit.


We choose the halting score function to be a simple linear model on top of the pooled features:


where is a global average pooling and .

Next, we determine , the number of residual units to evaluate, as the index of the first unit where the cumulative halting score exceeds :


where is a small constant (e.g., 0.01) that ensures that can be equal to 1 (the computation stops after the first unit) even though

is an output of a sigmoid function meaning that


Additionally, we define the remainder :


Due to the definition of in eqn. (7), we have .

We next transform the halting scores into a halting distribution, which is a discrete distribution over the residual units. Its property is that all the units starting from -st have zero probability:

1:3D tensor input
2:number of residual units in the block
4:3D tensor output
5:ponder cost
7: Cumulative halting score
8: Remainder value
9: Output of the block
11:for  do
13:     if  then
14:     else
15:     end if
18:     if  then
21:     else
24:          break
25:     end if
26:end for
Algorithm 1 Adaptive Computation Time for one block of residual units. ACT does not require storing the intermediate residual units outputs.

The output of the block is now defined as the outputs of residual units weighted by the halting distribution. Since representations of residual units are compatible with each other [19, 13], the weighted average also produces a feature representation of the same type. The values of have zero weight and therefore their evaluation can be skipped:


Ideally, we would like to directly minimize the number of evaluated units . However, is a piecewise constant function of the halting scores that cannot be optimized with gradient descent. Instead, we introduce the ponder cost , an almost everywhere differentiable upper bound on the number of evaluated units (recall that ):


When differentiating , we ignore the gradient of . Also, note that is not a continuous function of the halting scores [26]. The discontinuities happen in the configurations of halting scores where changes value. Following [12], we ignore these discontinuities and find that they do not impede training. Algorithm 1 shows the description of ACT.

The partial derivative of the ponder cost w.r.t. a halting score is


Therefore, minimizing the ponder cost increases , making the computation stop earlier. This effect is balanced by the original loss function which also depends on the halting scores via the block output, eqn. (10). Intuitively, the more residual units are used, the better the output, so minimizing usually increases the weight of the last used unit’s output , which in turn decreases .

ACT has several important advantages. First, it adds very few parameters and computation to the base model. Second, it allows to calculate the output of the block “on the fly” without storing all the intermediate residual unit outputs and halting scores in memory. For example, this would not be possible if the halting distribution were a softmax of halting scores, as done in soft attention [43]. Third, we can recover a block with any constant number of units by setting . Therefore, ACT is a strict generalization of standard ResNet.

We apply ACT to each block independently and then stack the obtained blocks as in the original ResNet. The input of the next block becomes the weighted average of the residual units from the previous block, eqn. (10). A similar connectivity pattern has been explored in [18]. We add the sum of the ponder costs from the blocks to the original loss function :


The resulting loss function

is differentiable and can be optimized using conventional backpropagation.

is a regularization coefficient which controls the trade-off between optimizing the original loss function and the ponder cost.

2.3 Spatially Adaptive Computation Time

In this section, we present Spatially Adaptive Computation Time (SACT). We adjust the per-position amount of computation by applying ACT to each spatial position of the block, as shown in fig. 4. As we show in the experiments, SACT can learn to focus the computation on the regions of interest.

Figure 4: Spatially Adaptive Computation Time (SACT) for one block of residual units. We apply ACT to each spatial position of the block. As soon as position’s cumulative halting score reaches 1, we mark it as inactive. See alg. 2. SACT learns to choose the appropriate amount of computation for each spatial position in the block.
Figure 5: Residual unit with active and inactive positions in SACT. This transformation can be implemented efficiently using the perforated convolutional layer [10].
Figure 6: SACT halting scores. Halting scores are evaluated fully convolutionally making SACT applicable to images of arbitrary resolution. SACT becomes ACT if the conv weights are set to zero.

We define the active positions as the spatial locations where the cumulative halting score is less than one. Because an active position might have inactive

neighbors, the values for the the inactive positions need to be imputed to evaluate the residual unit in the active positions. We simply copy the previous value for the inactive spatial positions, which is equivalent to setting the residual function

value to zero, as displayed in fig. 5. The evaluation of a block can be stopped completely as soon as all the positions become inactive. Also, the ponder cost is averaged across the spatial positions to make it comparable with the ACT ponder cost. The full algorithm is described in alg. 2.

1:3D tensor input
2:number of residual units in the block
4: input and output have different shapes
5:3D tensor output of shape
6:ponder cost
9:for all  do
10:      Active flag
11:      Cumulative halting score
12:      Remainder value
13:      Output of the block
14:      Per-position ponder cost
15:end for
16:for  do
17:     if  then break
18:     end if
19:     for all  do
20:          if  then
21:          else
22:          end if
23:     end for
24:     for all  do
25:          if  then continue
26:          end if
27:          if  then
28:          else
29:          end if
32:          if  then
35:          else
39:          end if
40:     end for
42:end for
Algorithm 2 Spatially Adaptive Computation Time for one block of residual units

We define the halting scores for SACT as


where denotes a convolution with a single output channel and is a global average-pooling (see fig. 6). SACT is fully convolutional and can be applied to images of any size.

Note that SACT is a more general model than ACT, and, consequently, than standard ResNet. If we choose , then the halting scores for all spatial positions coincide. In this case the computation for all the positions halts simultaneously and we recover the ACT model.

SACT requires evaluation of the residual function in just the active spatial positions. This can be performed efficiently using the perforated convolutional layer proposed in [10] (with skipped values replaced by zeros instead of the nearest neighbor’s values). Recall that the residual function consists of a stack of , and convolutional layers. The first convolutional layer has to be evaluated in the positions obtained by dilating the active positions set with a kernel. The second and third layers need to be evaluated just in the active positions.

An alternative approach to using the perforated convolutional layer is to tile the halting scores map. Suppose that we share the values of the halting scores within tiles. For example, we can perform pooling of with a kernel size and stride and then upscale the results by a factor of . Then, all positions in a tile have the same active flag, and we can apply the residual unit densely to just the active tiles, reusing the commonly available convolution routines. should be sufficiently high to mitigate the overhead of the additional kernel calls and the overlapping computations of the first convolution. Therefore, tiling is advisable when the SACT is applied to high-resolution images.

3 Related work

The majority of the work on increasing the computational efficiency of deep convolutional networks focuses on static techniques. These include decompositions of convolutional kernels [22] and pruning of connections [14]. Many of these techniques made their way into the design of the standard deep architectures. For example, Inception [40] and ResNet [16, 17] use factorized convolutional kernels.

Recently, several works have considered the problem of varying the amount of computation in computer vision. Cascaded classifiers 

[28, 44] are used in object detection to quickly reject “easy” negative proposals. Dynamic Capacity Networks [1] use the same amount of computation for all images and use image classification-specific heuristic. PerforatedCNNs [10] vary the amount of computation spatially but not between images. [3] proposes to tune the amount of computation in a fully-connected network using a REINFORCE-trained policy which makes the optimization problem significantly more challenging.

BranchyNet [41] is the most similar approach to ours although only applicable to classification problems. It adds classification branches to the intermediate layers of the network. As soon as the entropy of the intermediate classifications is below some threshold, the network’s evaluation halts. Our preliminary experiments with a similar procedure based on ACT (using ACT to choose the number of blocks to evaluate) show that it is inferior to using less units per block.

4 Experiments

We first apply ACT and SACT models to the image classification task for the ImageNet dataset [8]. We show that SACT achieves a better FLOPs-accuracy trade-off than ACT by directing computation to the regions of interest. Additionally, SACT improves the accuracy on high-resolution images compared to the ResNet model. Next, we use the obtained SACT model as a feature extractor in the Faster R-CNN object detection pipeline [35] on the COCO dataset [32]. Again we show that we obtain significantly improved FLOPs-mAP trade-off compared to basic ResNet models. Finally, we demonstrate that SACT ponder cost maps correlate well with the position of human eye fixations by evaluating them as a visual saliency model on the cat2000 dataset [4] without any training on this dataset.

4.1 Image classification (ImageNet dataset)

First, we train the basic ResNet-50 and ResNet-101 models from scratch using asynchronous SGD with momentum (see the supplementary text for the hyperparameters). Our models achieve similar performance to the reference implementation

111 For a single center resolution crop, the reference ResNet-101 model achieves 76.4% accuracy, 92.9% recall@5, while our implementation achieves 76% and 93.1%, respectively. Note that our model is the newer pre-activation ResNet [17] and the reference implementation is the post-activation ResNet [16].

We use ResNet-101 as the basic architecture for ACT and SACT models. Thanks to the end-to-end differentiability and deterministic behaviour, we find the same optimization hyperparameters are applicable for training of ACT and SACT as for the ResNet models. However, special care needs to be taken to address the dead residual unit problem in ACT and SACT models. Since ACT and SACT are deterministic, the last units in the blocks do not get enough training signal and their parameters become obsolete. As a result, the ponder cost saved by not using these units overwhelms the possible initial gains in the original loss function and the units are never used. We observe that while the dead residual units can be recovered during training, this process is very slow. Note that ACT-RNN [12] is not affected by this problem since the parameters for all timesteps are shared.

We find two techniques helpful for alleviating the dead residual unit problem. First, we initialize the bias of the halting scores units to a negative value to force the model to use the last units during the initial stages of learning. We use in the experiments which corresponds to initially using units. Second, we use a two-stage training procedure by initializing the ACT/SACT network’s weights from the pretrained ResNet-101 model. The halting score weights are still initialized randomly. This greatly simplifies learning of a reasonable halting policy in the beginning of training.

As a baseline for ACT and SACT, we consider a non-adaptive ResNet model with a similar number of floating point operations. We take the average numbers of units used in each block in the ACT or SACT model (for SACT we also average over the spatial dimensions) and round them to the nearest integers. Then, we train a ResNet model with such number of units per block. We follow the two-stage training procedure by initializing the network’s parameters with the the first residual units of the full ResNet-101 in each block. This slightly improves the performance compared to using the random initialization.

(a) Test resolution
(b) Test resolution
(c) Resolution vs. accuracy
(d) FLOPs vs. accuracy for varying resolution
Figure 7:

ImageNet validation set. Comparison of ResNet, ACT, SACT and the respective baselines. Error bars denote one standard deviation across images. All models are trained with

resolution images. SACT outperforms ACT and baselines when applied to images whose resolutions are higher than the training images. The advantage margin grows as resolution difference increases.
Figure 8: Ponder cost maps for each block (SACT , ImageNet validation image). Note that the first block reacts to the low-level features while the last two blocks attempt to localize the object.
Figure 9: ImageNet validation set. SACT () ponder cost maps. Top: low ponder cost (19.8-20.55), middle: average ponder cost (23.4-23.6), bottom: high ponder cost (24.9-26.0). SACT typically focuses the computation on the region of interest.

We compare ACT and SACT to ResNet-50, ResNet-101 and the baselines in fig. 7. We measure the average per-image number of floating point operations (FLOPs) required for evaluation of the validation set. We treat multiply-add as two floating point operations. The FLOPs are calculated just for the convolution operations (perforated convolution for SACT) since all other operations (non-linearities, pooling and output averaging in ACT/SACT) have minimal impact on this metric. The ACT models use and SACT models use . If we increase the image resolution at the test time, as suggested in [17], we observe that SACT outperforms ACT and the baselines. Surprisingly, in this setting SACT has higher accuracy than the ResNet-101 model while being computationally cheaper. Such accuracy improvement does not happen for the baseline models or ACT models. We attribute this to the improved scale tolerance provided by the SACT mechanism. The extended results of fig. 7(a,b), including the average number of residual units per block, are presented in the supplementary.

We visualize the ponder cost for each block of SACT as heat maps (which we call ponder cost maps henceforth) in fig. 8. More examples of the total SACT ponder cost maps are shown in fig. 9.

4.2 Object detection (COCO dataset)

Motivated by the success of SACT in classification of high-resolution images and ignoring uninformative background, we now turn to a harder problem of object detection. Object detection is typically performed for high-resolution images (such as , compared to for ImageNet classification) to allow detection of small objects. Computational redundancy becomes a big issue in this setting since a large image area is often occupied by the background.

We use the Faster R-CNN object detection pipeline [35]

which consists of three stages. First, the image is processed with a feature extractor. This is the most computationally expensive part. Second, a Region Proposal Network predicts a number of class-agnostic rectangular proposals (typically 300). Third, each proposal box’s features are cropped from the feature map and passed through a box classifier which predicts whether the proposal corresponds to an object, the class of this object and refines the boundaries. We train the model end-to-end using asynchronous SGD with momentum, employing Tensorflow’s


operation, which is similar to the Spatial Transformer Network 

[21], to perform cropping of the region proposals. The training hyperparameters are provided in the supplementary.

We use ResNet blocks 1-3 as a feature extractor and block 4 as a box classifier, as suggested in [16]. We reuse the models pretrained on the ImageNet classification task and fine-tune them for COCO detection. For SACT, the ponder cost penalty is only applied to the feature extractor (we use the same value as for ImageNet classification). We use COCO train for training and COCO val for evaluation (instead of the combined train+val set which is sometimes used in the literature). We do not employ multiscale inference, iterative box refinement or global context.

We find that SACT achieves superior speed-mAP trade-off compared to the baseline of using non-adaptive ResNet as a feature extractor (see table 1). SACT model has slightly higher FLOPs count than ResNet-50 and points better mAP. Note that this SACT model outperforms the originally reported result for ResNet-101, mAP [16]. Several examples are presented in fig. 10.

Feature extractor FLOPs (%) mAP @ (%)
ResNet-101 [16]
ResNet-50 (our impl.)
ResNet-101 (our impl.)
Table 1: COCO val set. Faster R-CNN with SACT results. FLOPs are average ( one standard deviation) feature extractor floating point operations relative to ResNet-101 (that does 1.42E+11 operations). SACT improves the FLOPs-mAP trade-off compared to using ResNet without adaptive computation.
Figure 10: COCO testdev set. Detections and feature extractor ponder cost maps (). SACT allocates much more computation to the object-like regions of the image.

4.3 Visual saliency (cat2000 dataset)

We now show that SACT ponder cost maps correlate well with human attention. To do that, we use a large dataset of visual saliency: the cat2000 dataset [4]. The dataset is obtained by showing 4,000 images of 20 scene categories to 24 human subjects and recording their eye fixation positions. The ground-truth saliency map is a heat map of the eye fixation positions. We do not train the SACT models on this dataset and simply reuse the ImageNet- and COCO-trained models. Cat2000 saliency maps exhibit a strong center bias. Most images contain a blob of saliency in the center even when there is no object of interest located there. Since our model is fully convolutional, we cannot learn such bias even if we trained on the saliency data. Therefore, we combine our ponder cost maps with a constant center-biased map.

We resize the cat2000 images to for ImageNet model and to for COCO model and pass them through the SACT model. Following [4], we consider a linear combination of the Gaussian blurred ponder cost map normalized to range and a “center baseline,” a Gaussian centered at the middle of the image. Full description of the combination scheme is provided in the supplementary. The first half of the training set images for every scene category is used for determining the optimal values of the Gaussian blur kernel size and the center baseline multiplier, while the second half is used for validation.

Table 2 presents the AUC-Judd [5] metric, the area under the ROC-curve for the saliency map as a predictor for eye fixation positions. SACT outperforms the naïve center baseline. Compared to the state-of-the-art deep model DeepFix [25] method, SACT does competitively. Examples are shown in fig. 11.

Figure 11: cat2000 saliency dataset. Left to right: image, human saliency, SACT ponder cost map (COCO model, ) with postprocessing (see text) and softmax with temperature . Note the center bias of the dataset. SACT model performs surprisingly well on out-of-domain images such as art and fractals.
Model AUC-Judd (%)
Center baseline [4]
DeepFix [25]
“Infinite humans” [4]
ImageNet SACT
Table 2: cat2000 validation set. - results for the test set. SACT ponder cost maps work as a visual saliency model even without explicit supervision.

5 Conclusion

We present a Residual Network based model with a spatially varying computation time. This model is end-to-end trainable, deterministic and can be viewed as a black-box feature extractor. We show its effectiveness in image classification and object detection problems. The amount of per-position computation in this model correlates well with the human eye fixation positions, suggesting that this model captures the important parts of the image. We hope that this paper will lead to a wider adoption of attention and adaptive computation time in large-scale computer vision systems. The source code is available at

Acknowledgments. D. Vetrov is supported by Russian Academic Excellence Project ‘5-100’. R. Salakhutdinov is supported in part by ONR grants N00014-13-1-0721, N00014-14-1-0232, and the ADeLAIDE grant FA8750-16C-0130-001.


Appendix A Implementation details

a.1 Image classification (ImageNet)

Optimization hyperparameters. ResNet, ACT and SACT use the same hyperparameters. We train the networks with 50 workers running asynchronous SGD with momentum , weight decay and batch size . The training is halted upon convergence after epochs. Learning rate is initially set to and lowered by a factor of after every epochs. Batch normalization parameters are: epsilon 1e-5, moving average decay

. The parameters of the network are initialized with a variance scaling initializer


Data augmentation. We use the Inception v3 data augmentation procedure222 which includes horizontal flipping, scale, aspect ratio, color augmentation. For the ImageNet images with a provided bounding box, we perform cropping based on the distorted bounding box. For evaluation, we take a single central crop of of the original image’s area and then resize this crop to the target resolution.

a.2 Object detection (COCO)

ResNet and SACT models use the same hyperparameters. The images are upscaled (preserving aspect ratio) so that the smaller side is at least 600 pixels. For data augmentation, we use random horizontal flipping as is described in [35]. We do not employ atrous convolution algorithm.

Optimization hyperparameters. We use distributed training with 9 workers running asynchronous SGD with momentum and a batch size of . The learning rate is initially set to and lowered by a factor of 10 after the 800 thousandth and 1 millionth training iterations (batches). The training proceeds for a total of 1.2 million iterations. Batch normalization parameters are fixed to the values obtained on ImageNet during training.

Faster R-CNN hyperparameters. Other than the training method, our hyperparameters for the Faster R-CNN model closely follow those recommended by the original paper [35]. The anchors are generated in the same way, sampled from a regular grid of stride 16. One change relative to the original paper is the addition of an additional anchor size, so full set of anchor box sizes are , with the height and width also varying for each choice of the aspect ratios . We use 300 object proposals per image. Non-maximum suppression is performed with IoU threshold. For each proposal, the features are cropped into a box with crop_and_resize TensorFlow operation, then pooled to .

a.3 Visual saliency (cat2000)

Here we describe the postprocessing procedure used in the visual saliency experiments. Consider a ponder cost map . Let be a Gaussian filter with a standard deviation .

We first normalize this map to :




and similarily for .

Then, we blur the ponder cost map by convolving the Gaussian filter with :


We obtain a baseline map by rescaling the reference centered Gaussian333 to resolution. We use this map with a weight .

Finally, the postprocessed ponder cost map is defined as a normalized blurred ponder cost map plus the weighted center baseline map:


depends on two hyperparameters: the Gaussian filter standard deviation and baseline map weight . The values are tuned via grid search. In the experiments in the paper, we use , for both models.

Appendix B Additional ImageNet results

We present the extended results of ACT, SACT and ResNet models on ImageNet validation set, including the number of residual units per block, in table 3.

The ImageNet models in the paper are trained with images. Even though all the models are fully convolutional and can be applied to images of any resolution during test time, increasing the training resolution can improve the quality of the model at the cost of longer training and higher GPU memory requirements. We have explored training of SACT model with a resolution of . This resolution is the highest we could fit into GPU memory for a batch size of 32 (decreasing the batch size deteriorates the accuracy). Comparison of SACT models trained with resolutions and is presented on fig. 12. Interestingly, both considered models achieve the highest accuracy at test resolution . This accuracy is for the training resolution of and for the training resolution of . In the second case, the FLOPs are higher. When the training resolution is increased, the accuracy at lower resolutions is predictably worse, while the accuracy at higher resolutions is better. The results suggests that training at higher resolutions is beneficial, and that the improved scale tolerance is not diminished when the training resolution is increased.

Network FLOPs Residual units Accuracy Recall@5
(a) Test resolution
Network FLOPs Residual units Accuracy Recall@5
(b) Test resolution
Table 3: ImageNet validation set. Comparison of ResNet, ACT, SACT and the respective baselines. All models are trained with resolution images. denotes mean value and one standard deviation .
(a) Resolution vs. accuracy
(b) FLOPs vs. accuracy for varying test resolution
Figure 12: ImageNet validation set. Comparison of SACT with different training resolutions.