This paper proposes a deep learning architecture based on Residual Network that dynamically adjusts the number of executed layers for the regions of the image. This architecture is end-to-end trainable, deterministic and problem-agnostic. It is therefore applicable without any modifications to a wide range of computer vision problems such as image classification, object detection and image segmentation. We present experimental results showing that this model improves the computational efficiency of Residual Networks on the challenging ImageNet classification and COCO object detection datasets. Additionally, we evaluate the computation time maps on the visual saliency dataset cat2000 and find that they correlate surprisingly well with human eye fixation positions.READ FULL TEXT VIEW PDF
Deep residual networks have recently emerged as the state-of-the-art
In this paper we describe a new mobile architecture, MobileNetV2, that
In this paper we describe a new mobile architecture, MobileNetV2, that
We propose a framework for top-down salient object detection that
With a single eye fixation lasting a fraction of a second, the human vis...
Deep learning has given way to a new era of machine learning, apart from...
Deep residual networks (ResNets) made a recent breakthrough in deep lear...
Deep convolutional networks gained a wide adoption in the image classification problem [24, 39, 40] due to their exceptional accuracy. In recent years deep convolutional networks have become an integral part of state-of-the-art systems for a diverse set of computer vision problems such as object detection , image segmentation , image-to-text [23, 43], visual question answering  and image generation . They have also been shown to be surprisingly effective in non-vision domains, e.g45] and analyzing the board in the game of Go .
A major drawback of deep convolutional networks is their huge computational cost. A natural way to tackle this issue is by using attention to guide the computation, which is similar to how biological vision systems operate 
. Glimpse-based attention models[27, 34, 2, 21]
assume that the problem at hand can be solved by carefully processing a small number of typically rectangular sub-regions of the image. This makes such models unsuitable for multi-output problems (generating box proposals in object detection) and per-pixel prediction problems (image segmentation, image generation). Additionally, choosing the glimpse positions requires designing a separate prediction network or a heuristic procedure. On the other hand, soft spatial attention models [43, 37] do not allow to save computation since they require evaluating the model at all spatial positions to choose per-position attention weights.
We build upon the Adaptive Computation Time (ACT) 
mechanism which was recently proposed for Recurrent Neural Networks (RNNs). We show that ACT can be applied to dynamically choose the number of evaluated layers in Residual Network[16, 17] (the similarity between Residual Networks and RNNs was explored in [30, 13]). Next, we propose Spatially Adaptive Computation Time (SACT) which adapts the amount of computation between spatial positions. While we use SACT mechanism for Residual Networks, it can potentially be used for convolutional LSTM  models for video processing .
SACT is an end-to-end trainable architecture that incorporates attention into Residual Networks. It learns a deterministic policy that stops computation in a spatial position as soon as the features become “good enough”. Since SACT maintains the alignment between the image and the feature maps, it is well-suited for a wide range of computer vision problems, including multi-output and per-pixel prediction problems.
We evaluate the proposed models on the ImageNet classification problem  and find that SACT outperforms both ACT and non-adaptive baselines. Then, we use SACT as a feature extractor in the Faster R-CNN object detection pipeline  and demonstrate results on the challenging COCO dataset . Example detections and a ponder cost (computation time) map are presented in fig. 1. SACT achieves significantly superior FLOPs-quality trade-off to the non-adaptive ResNet model. Finally, we demonstrate that the obtained computation time maps are well-correlated with human eye fixations positions, suggesting that a reasonable attention model arises in the model automatically without any explicit supervision.
We begin by outlining the recently proposed deep convolutional model Residual Network (ResNet) [16, 17]. Then, we present Adaptive Computation Time, a model which adaptively chooses the number of residual units in ResNet. Finally, we show how this idea can be applied at the spatial position level to obtain Spatially Adaptive Computation Time model.
problems. The models we propose are general and can be applied to any ResNet architecture. The first two layers of ResNet-101 are a convolution and a max-pooling layer which together have a total stride of four. Then, a sequence of four blocks is stacked together, each block consisting of multiple stackedresidual units. ResNet-101 contains four blocks with 3, 4, 23 and 3 units, respectively. A residual unit has a form , where the first term is called a shortcut connection and the second term is a residual function. A residual function consists of three convolutional layers: layer that reduces the number of channels, layer that has equal number of input and output channels and layer that restores the number of channels. We use pre-activation ResNet 
in which each convolutional layer is preceded by batch normalization
and ReLU non-linearity. The first units in blocks 2-4 have a stride of 2 and increases the number of output channels by a factor of 2. All other units have equal input and output dimensions. This design choice follows Very Deep Networks and ensures that all units in the network have an equal computational cost (except for the first units of blocks 2-4 having a slightly higher cost).
Finally, the obtained feature map is passed through a global average pooling layer fully convolutional meaning that it can be applied to images of varying resolutions without changing the network’s parameters.
Let us first informally explain Adaptive Computation Time (ACT) before describing it in more detail and providing an algorithm. We add a branch to the outputs of each residual unit which predicts a halting score, a scalar value in the range . The residual units and the halting scores are evaluated sequentially, as shown in fig. 3. As soon as the cumulative sum of the halting score reaches one, all following residual units in this block will be skipped. We set the halting distribution to be the evaluated halting scores with the last value replaced by a remainder. This ensures that the distribution over the values of the halting scores sums to one. The output of the block is then re-defined as a weighted sum of the outputs of residual units, where the weight of each unit is given by the corresponding probability value. Finally, a ponder cost is introduced that is the number of evaluated residual units plus the remainder value. Minimizing the ponder cost increases the halting scores of the non-last residual units making it more likely that the computation would stop earlier. The ponder cost is then multiplied by a constant
and added to the original loss function. ACT is applied to each block of ResNet independently with the ponder costs summed.
Formally, we consider a block of
residual units (boldface denotes tensors of shape HeightWidth Channels):
We introduce a halting score for each residual unit. We define to enforce stopping after the last unit.
We choose the halting score function to be a simple linear model on top of the pooled features:
where is a global average pooling and .
Next, we determine , the number of residual units to evaluate, as the index of the first unit where the cumulative halting score exceeds :
where is a small constant (e.g., 0.01) that ensures that can be equal to 1 (the computation stops after the first unit) even though
is an output of a sigmoid function meaning that.
Additionally, we define the remainder :
Due to the definition of in eqn. (7), we have .
We next transform the halting scores into a halting distribution, which is a discrete distribution over the residual units. Its property is that all the units starting from -st have zero probability:
The output of the block is now defined as the outputs of residual units weighted by the halting distribution. Since representations of residual units are compatible with each other [19, 13], the weighted average also produces a feature representation of the same type. The values of have zero weight and therefore their evaluation can be skipped:
Ideally, we would like to directly minimize the number of evaluated units . However, is a piecewise constant function of the halting scores that cannot be optimized with gradient descent. Instead, we introduce the ponder cost , an almost everywhere differentiable upper bound on the number of evaluated units (recall that ):
When differentiating , we ignore the gradient of . Also, note that is not a continuous function of the halting scores . The discontinuities happen in the configurations of halting scores where changes value. Following , we ignore these discontinuities and find that they do not impede training. Algorithm 1 shows the description of ACT.
The partial derivative of the ponder cost w.r.t. a halting score is
Therefore, minimizing the ponder cost increases , making the computation stop earlier. This effect is balanced by the original loss function which also depends on the halting scores via the block output, eqn. (10). Intuitively, the more residual units are used, the better the output, so minimizing usually increases the weight of the last used unit’s output , which in turn decreases .
ACT has several important advantages. First, it adds very few parameters and computation to the base model. Second, it allows to calculate the output of the block “on the fly” without storing all the intermediate residual unit outputs and halting scores in memory. For example, this would not be possible if the halting distribution were a softmax of halting scores, as done in soft attention . Third, we can recover a block with any constant number of units by setting . Therefore, ACT is a strict generalization of standard ResNet.
We apply ACT to each block independently and then stack the obtained blocks as in the original ResNet. The input of the next block becomes the weighted average of the residual units from the previous block, eqn. (10). A similar connectivity pattern has been explored in . We add the sum of the ponder costs from the blocks to the original loss function :
The resulting loss function
is differentiable and can be optimized using conventional backpropagation.is a regularization coefficient which controls the trade-off between optimizing the original loss function and the ponder cost.
In this section, we present Spatially Adaptive Computation Time (SACT). We adjust the per-position amount of computation by applying ACT to each spatial position of the block, as shown in fig. 4. As we show in the experiments, SACT can learn to focus the computation on the regions of interest.
We define the active positions as the spatial locations where the cumulative halting score is less than one. Because an active position might have inactive
neighbors, the values for the the inactive positions need to be imputed to evaluate the residual unit in the active positions. We simply copy the previous value for the inactive spatial positions, which is equivalent to setting the residual functionvalue to zero, as displayed in fig. 5. The evaluation of a block can be stopped completely as soon as all the positions become inactive. Also, the ponder cost is averaged across the spatial positions to make it comparable with the ACT ponder cost. The full algorithm is described in alg. 2.
We define the halting scores for SACT as
where denotes a convolution with a single output channel and is a global average-pooling (see fig. 6). SACT is fully convolutional and can be applied to images of any size.
Note that SACT is a more general model than ACT, and, consequently, than standard ResNet. If we choose , then the halting scores for all spatial positions coincide. In this case the computation for all the positions halts simultaneously and we recover the ACT model.
SACT requires evaluation of the residual function in just the active spatial positions. This can be performed efficiently using the perforated convolutional layer proposed in  (with skipped values replaced by zeros instead of the nearest neighbor’s values). Recall that the residual function consists of a stack of , and convolutional layers. The first convolutional layer has to be evaluated in the positions obtained by dilating the active positions set with a kernel. The second and third layers need to be evaluated just in the active positions.
An alternative approach to using the perforated convolutional layer is to tile the halting scores map. Suppose that we share the values of the halting scores within tiles. For example, we can perform pooling of with a kernel size and stride and then upscale the results by a factor of . Then, all positions in a tile have the same active flag, and we can apply the residual unit densely to just the active tiles, reusing the commonly available convolution routines. should be sufficiently high to mitigate the overhead of the additional kernel calls and the overlapping computations of the first convolution. Therefore, tiling is advisable when the SACT is applied to high-resolution images.
The majority of the work on increasing the computational efficiency of deep convolutional networks focuses on static techniques. These include decompositions of convolutional kernels  and pruning of connections . Many of these techniques made their way into the design of the standard deep architectures. For example, Inception  and ResNet [16, 17] use factorized convolutional kernels.
Recently, several works have considered the problem of varying the amount of computation in computer vision. Cascaded classifiers[28, 44] are used in object detection to quickly reject “easy” negative proposals. Dynamic Capacity Networks  use the same amount of computation for all images and use image classification-specific heuristic. PerforatedCNNs  vary the amount of computation spatially but not between images.  proposes to tune the amount of computation in a fully-connected network using a REINFORCE-trained policy which makes the optimization problem significantly more challenging.
BranchyNet  is the most similar approach to ours although only applicable to classification problems. It adds classification branches to the intermediate layers of the network. As soon as the entropy of the intermediate classifications is below some threshold, the network’s evaluation halts. Our preliminary experiments with a similar procedure based on ACT (using ACT to choose the number of blocks to evaluate) show that it is inferior to using less units per block.
We first apply ACT and SACT models to the image classification task for the ImageNet dataset . We show that SACT achieves a better FLOPs-accuracy trade-off than ACT by directing computation to the regions of interest. Additionally, SACT improves the accuracy on high-resolution images compared to the ResNet model. Next, we use the obtained SACT model as a feature extractor in the Faster R-CNN object detection pipeline  on the COCO dataset . Again we show that we obtain significantly improved FLOPs-mAP trade-off compared to basic ResNet models. Finally, we demonstrate that SACT ponder cost maps correlate well with the position of human eye fixations by evaluating them as a visual saliency model on the cat2000 dataset  without any training on this dataset.
First, we train the basic ResNet-50 and ResNet-101 models from scratch using asynchronous SGD with momentum (see the supplementary text for the hyperparameters). Our models achieve similar performance to the reference implementation111https://github.com/KaimingHe/deep-residual-networks. For a single center resolution crop, the reference ResNet-101 model achieves 76.4% accuracy, 92.9% recall@5, while our implementation achieves 76% and 93.1%, respectively. Note that our model is the newer pre-activation ResNet  and the reference implementation is the post-activation ResNet .
We use ResNet-101 as the basic architecture for ACT and SACT models. Thanks to the end-to-end differentiability and deterministic behaviour, we find the same optimization hyperparameters are applicable for training of ACT and SACT as for the ResNet models. However, special care needs to be taken to address the dead residual unit problem in ACT and SACT models. Since ACT and SACT are deterministic, the last units in the blocks do not get enough training signal and their parameters become obsolete. As a result, the ponder cost saved by not using these units overwhelms the possible initial gains in the original loss function and the units are never used. We observe that while the dead residual units can be recovered during training, this process is very slow. Note that ACT-RNN  is not affected by this problem since the parameters for all timesteps are shared.
We find two techniques helpful for alleviating the dead residual unit problem. First, we initialize the bias of the halting scores units to a negative value to force the model to use the last units during the initial stages of learning. We use in the experiments which corresponds to initially using units. Second, we use a two-stage training procedure by initializing the ACT/SACT network’s weights from the pretrained ResNet-101 model. The halting score weights are still initialized randomly. This greatly simplifies learning of a reasonable halting policy in the beginning of training.
As a baseline for ACT and SACT, we consider a non-adaptive ResNet model with a similar number of floating point operations. We take the average numbers of units used in each block in the ACT or SACT model (for SACT we also average over the spatial dimensions) and round them to the nearest integers. Then, we train a ResNet model with such number of units per block. We follow the two-stage training procedure by initializing the network’s parameters with the the first residual units of the full ResNet-101 in each block. This slightly improves the performance compared to using the random initialization.
ImageNet validation set. Comparison of ResNet, ACT, SACT and the respective baselines. Error bars denote one standard deviation across images. All models are trained withresolution images. SACT outperforms ACT and baselines when applied to images whose resolutions are higher than the training images. The advantage margin grows as resolution difference increases.
We compare ACT and SACT to ResNet-50, ResNet-101 and the baselines in fig. 7. We measure the average per-image number of floating point operations (FLOPs) required for evaluation of the validation set. We treat multiply-add as two floating point operations. The FLOPs are calculated just for the convolution operations (perforated convolution for SACT) since all other operations (non-linearities, pooling and output averaging in ACT/SACT) have minimal impact on this metric. The ACT models use and SACT models use . If we increase the image resolution at the test time, as suggested in , we observe that SACT outperforms ACT and the baselines. Surprisingly, in this setting SACT has higher accuracy than the ResNet-101 model while being computationally cheaper. Such accuracy improvement does not happen for the baseline models or ACT models. We attribute this to the improved scale tolerance provided by the SACT mechanism. The extended results of fig. 7(a,b), including the average number of residual units per block, are presented in the supplementary.
Motivated by the success of SACT in classification of high-resolution images and ignoring uninformative background, we now turn to a harder problem of object detection. Object detection is typically performed for high-resolution images (such as , compared to for ImageNet classification) to allow detection of small objects. Computational redundancy becomes a big issue in this setting since a large image area is often occupied by the background.
We use the Faster R-CNN object detection pipeline 
which consists of three stages. First, the image is processed with a feature extractor. This is the most computationally expensive part. Second, a Region Proposal Network predicts a number of class-agnostic rectangular proposals (typically 300). Third, each proposal box’s features are cropped from the feature map and passed through a box classifier which predicts whether the proposal corresponds to an object, the class of this object and refines the boundaries. We train the model end-to-end using asynchronous SGD with momentum, employing Tensorflow’s
operation, which is similar to the Spatial Transformer Network, to perform cropping of the region proposals. The training hyperparameters are provided in the supplementary.
We use ResNet blocks 1-3 as a feature extractor and block 4 as a box classifier, as suggested in . We reuse the models pretrained on the ImageNet classification task and fine-tune them for COCO detection. For SACT, the ponder cost penalty is only applied to the feature extractor (we use the same value as for ImageNet classification). We use COCO train for training and COCO val for evaluation (instead of the combined train+val set which is sometimes used in the literature). We do not employ multiscale inference, iterative box refinement or global context.
We find that SACT achieves superior speed-mAP trade-off compared to the baseline of using non-adaptive ResNet as a feature extractor (see table 1). SACT model has slightly higher FLOPs count than ResNet-50 and points better mAP. Note that this SACT model outperforms the originally reported result for ResNet-101, mAP . Several examples are presented in fig. 10.
|Feature extractor||FLOPs (%)||mAP @ (%)|
|ResNet-50 (our impl.)|
|ResNet-101 (our impl.)|
We now show that SACT ponder cost maps correlate well with human attention. To do that, we use a large dataset of visual saliency: the cat2000 dataset . The dataset is obtained by showing 4,000 images of 20 scene categories to 24 human subjects and recording their eye fixation positions. The ground-truth saliency map is a heat map of the eye fixation positions. We do not train the SACT models on this dataset and simply reuse the ImageNet- and COCO-trained models. Cat2000 saliency maps exhibit a strong center bias. Most images contain a blob of saliency in the center even when there is no object of interest located there. Since our model is fully convolutional, we cannot learn such bias even if we trained on the saliency data. Therefore, we combine our ponder cost maps with a constant center-biased map.
We resize the cat2000 images to for ImageNet model and to for COCO model and pass them through the SACT model. Following , we consider a linear combination of the Gaussian blurred ponder cost map normalized to range and a “center baseline,” a Gaussian centered at the middle of the image. Full description of the combination scheme is provided in the supplementary. The first half of the training set images for every scene category is used for determining the optimal values of the Gaussian blur kernel size and the center baseline multiplier, while the second half is used for validation.
Table 2 presents the AUC-Judd  metric, the area under the ROC-curve for the saliency map as a predictor for eye fixation positions. SACT outperforms the naïve center baseline. Compared to the state-of-the-art deep model DeepFix  method, SACT does competitively. Examples are shown in fig. 11.
We present a Residual Network based model with a spatially varying computation time. This model is end-to-end trainable, deterministic and can be viewed as a black-box feature extractor. We show its effectiveness in image classification and object detection problems. The amount of per-position computation in this model correlates well with the human eye fixation positions, suggesting that this model captures the important parts of the image. We hope that this paper will lead to a wider adoption of attention and adaptive computation time in large-scale computer vision systems. The source code is available at https://github.com/mfigurnov/sact.
Acknowledgments. D. Vetrov is supported by Russian Academic Excellence Project ‘5-100’. R. Salakhutdinov is supported in part by ONR grants N00014-13-1-0721, N00014-14-1-0232, and the ADeLAIDE grant FA8750-16C-0130-001.
Learning to generate chairs with convolutional neural networks.CVPR, 2015.
Highway and residual networks learn unrolled iterative estimation.ICLR, 2017.
Learning to combine foveal glimpses with a third-order boltzmann machine.NIPS, 2010.
A convolutional neural network cascade for face detection.CVPR, 2015.
Convolutional lstm network: A machine learning approach for precipitation nowcasting.In NIPS, 2015.
Optimization hyperparameters. ResNet, ACT and SACT use the same hyperparameters. We train the networks with 50 workers running asynchronous SGD with momentum , weight decay and batch size . The training is halted upon convergence after epochs. Learning rate is initially set to and lowered by a factor of after every epochs. Batch normalization parameters are: epsilon 1e-5, moving average decay
. The parameters of the network are initialized with a variance scaling initializer.
Data augmentation. We use the Inception v3 data augmentation procedure222https://github.com/tensorflow/models/blob/master/inception/inception/image_processing.py which includes horizontal flipping, scale, aspect ratio, color augmentation. For the ImageNet images with a provided bounding box, we perform cropping based on the distorted bounding box. For evaluation, we take a single central crop of of the original image’s area and then resize this crop to the target resolution.
ResNet and SACT models use the same hyperparameters. The images are upscaled (preserving aspect ratio) so that the smaller side is at least 600 pixels. For data augmentation, we use random horizontal flipping as is described in . We do not employ atrous convolution algorithm.
Optimization hyperparameters. We use distributed training with 9 workers running asynchronous SGD with momentum and a batch size of . The learning rate is initially set to and lowered by a factor of 10 after the 800 thousandth and 1 millionth training iterations (batches). The training proceeds for a total of 1.2 million iterations. Batch normalization parameters are fixed to the values obtained on ImageNet during training.
Faster R-CNN hyperparameters.
Other than the training method, our hyperparameters for the Faster
R-CNN model closely follow those recommended by the original paper .
The anchors are generated in the same way, sampled from a regular grid of stride 16.
One change relative to the original paper is the addition of an additional anchor size, so full set of anchor box sizes are
, with the height and width also varying for each choice of the aspect ratios .
We use 300 object proposals per image.
Non-maximum suppression is performed with IoU threshold.
For each proposal, the features are cropped into a box with
crop_and_resize TensorFlow operation, then pooled
Here we describe the postprocessing procedure used in the visual saliency experiments. Consider a ponder cost map . Let be a Gaussian filter with a standard deviation .
We first normalize this map to :
and similarily for .
Then, we blur the ponder cost map by convolving the Gaussian filter with :
We obtain a baseline map by rescaling the reference centered Gaussian333https://github.com/cvzoya/saliency/blob/master/code_forOptimization/center.mat to resolution. We use this map with a weight .
Finally, the postprocessed ponder cost map is defined as a normalized blurred ponder cost map plus the weighted center baseline map:
depends on two hyperparameters: the Gaussian filter standard deviation and baseline map weight . The values are tuned via grid search. In the experiments in the paper, we use , for both models.
We present the extended results of ACT, SACT and ResNet models on ImageNet validation set, including the number of residual units per block, in table 3.
The ImageNet models in the paper are trained with images. Even though all the models are fully convolutional and can be applied to images of any resolution during test time, increasing the training resolution can improve the quality of the model at the cost of longer training and higher GPU memory requirements. We have explored training of SACT model with a resolution of . This resolution is the highest we could fit into GPU memory for a batch size of 32 (decreasing the batch size deteriorates the accuracy). Comparison of SACT models trained with resolutions and is presented on fig. 12. Interestingly, both considered models achieve the highest accuracy at test resolution . This accuracy is for the training resolution of and for the training resolution of . In the second case, the FLOPs are higher. When the training resolution is increased, the accuracy at lower resolutions is predictably worse, while the accuracy at higher resolutions is better. The results suggests that training at higher resolutions is beneficial, and that the improved scale tolerance is not diminished when the training resolution is increased.