Scaling Local Self-Attention For Parameter Efficient Visual Backbones

03/23/2021 ∙ by Ashish Vaswani, et al. ∙ 0

Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions, in contrast to parameter-dependent scaling and content-independent interactions of convolutions. Self-attention models have recently been shown to have encouraging improvements on accuracy-parameter trade-offs compared to baseline convolutional models such as ResNet-50. In this work, we aim to develop self-attention models that can outperform not just the canonical baseline models, but even the high-performing convolutional models. We propose two extensions to self-attention that, in conjunction with a more efficient implementation of self-attention, improve the speed, memory usage, and accuracy of these models. We leverage these improvements to develop a new self-attention model family, HaloNets, which reach state-of-the-art accuracies on the parameter-limited setting of the ImageNet classification benchmark. In preliminary transfer learning experiments, we find that HaloNet models outperform much larger models and have better inference performance. On harder tasks such as object detection and instance segmentation, our simple local self-attention and convolutional hybrids show improvements over very strong baselines. These results mark another step in demonstrating the efficacy of self-attention models on settings traditionally dominated by convolutional models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Vision and natural language processing (NLP) systems divide the landscape of computational primitives. While self-attention is the primary workhorse in NLP, convolutions are ubiquitous in nearly all vision models. Convolutions embody the principle of

local processing, to learn local spatial features such as edges and texture that are abundant in images. On the other hand, the Transformer [vaswani2017attention] showed that self-attention is an effective and computationally efficient mechanism for capturing global interactions between words in a sentence. The success of self-attention in NLP motivates research in understanding how self-attention can improve vision. Self-attention has several properties that make it a good fit for vision: (a) content-based interactions as opposed to content-independent interactions of convolution; (b) parameter-independent scaling of receptive field size as opposed to parameter-dependent scaling of convolution; (c) empirical ability to capture long-range dependencies for use in larger images; (d) flexibility to handle and integrate multiple types of data that appear in vision, such as pixels [wang2018non, bello2019attention, ramachandran2017searching, zhao2020exploring], point clouds [yang2019modeling], sequence conditioning information [xu2015show], and graphs [li2019relation]. Self-attention may also be regarded as an adaptive nonlinearity paralleling a long history of nonlinear processing techniques in computer vision, such as bilateral filtering [paris2009bilateral] and non-local means [buades2005non].

Several recent papers [bello2019attention, ramachandran2019standalone, dosovitskiy2020image, zhao2020exploring, srinivas2021bottleneck] have attempted using self-attention primitives to improve image classification accuracy over the strong and commonly used ResNet backbones [he2016deep, he2016identity]. Among them, the Stand-Alone Self-Attention (SASA) [ramachandran2019standalone] is a fully self-attentive model that replaces every spatial convolution with local self-attention, which improves the performance of ResNet backbones while having fewer parameters and floating point operations. While conceptually promising, these models lag behind state-of-the-art convolutional models in image classification. State-of-the-art convolutional models [tan2019efficientnet, zoph2018learning, radosavovic2020designing] use a variety of scaling techniques to achieve strong performance across a range of computation and parameter regimes. In this work, we aim to develop and understand techniques for scaling local

self-attention models to outperform some of the best convolutional models. Scaling self-attention models presents a unique set of challenges. For example, convolutions have been very efficiently mapped to matrix accelerators such as TPUs and GPUs that drive most deep learning workloads, but fast implementations of local 2D self-attention do not currently exist. To bridge this gap, we introduce a

non-centered version of local attention that efficiently maps to existing hardware with haloing. While our formulation breaks translational equivariance

, it improves both throughput and accuracies over the centered local self-attention used in SASA. We also introduce a strided self-attentive downsampling operation for multi-scale feature extraction.

We leverage these techniques to develop a new local self-attention model family, HaloNet, which achieves state-of-the-art performance across different parameter regimes. The largest HaloNet achieves 84.9% top-1 accuracy on the ImageNet [russakovsky2015imagenet] classification benchmark (Section 4.1). We perform a detailed study to uncover how self-attention and convolutional models scale differently. Our self-attention layers also show promising results on harder tasks such as object detection and instance segmentation (Section 4.6

) using the Mask R-CNN framework on the COCO benchmark. Finally, we end with a discussion of current limitations and ideas for future work in applying self-attention to vision.

2 Models and Methods

Although our models use self-attention instead of convolutions for capturing spatial interactions between pixels, they adopt some important architectural features of modern convolutional neural networks (CNNs). Like CNNs, we compute

multi-scale feature hierarchies [lin2017feature] which enable detecting objects at multiple sizes in tasks such as localization and instance segmentation. For this, we develop a strided self-attention layer, a natural extension of strided convolutions (Section 2.2). To deal with the computational cost in larger resolutions where global attention is infeasible, we follow the fairly general principle of local processing, which is at the heart of convolutions and natural perceptual systems [hubel1963shape, hubel1968receptive], and use spatially restricted forms of self-attention. However, unlike the model of [ramachandran2019standalone], that also use local self-attention, we abstain from enforcing translation equivariance in lieu of better hardware utilization, which improves the speed-accuracy tradeoff (Section 2.2). Also note that while we use local attention, our receptive fields per pixel are quite large (up to ) and we show in Section 4.2.2 that larger receptive fields help with larger images. In the remainder of this section, we will motivate self-attention for vision tasks and describe how we relax translational equivariance to efficiently map local self-attention to hardware.

Figure 1: HaloNet local self-attention architecture: The different stages of blocked local attention for a image, block size , and halo . The image is first blocked into non-overlapping images from which the queries are computed. The subsequent haloing step then extracts a

memory around each of the blocks which linearly transform to keys and values. The spatial dimensions after attention are the same as the queries.

Figure 2: The attention downsampling layer subsamples the queries but keeps the neighborhood the same as the the stride=1 case.

2.1 Self-attention can generate spatially varying convolutional filters

Recent work [cordonnier2019relationship] has shown that self-attention with sufficient number of heads and the right geometric biases can simulate convolutions, suggesting a deeper relationship between self-attention and convolutions. Self-attention has been viewed as a method to directly capture relationships between distant pixels [ramachandran2019standalone, hu2019local, wang2020axial]. It has also been interpreted as a specific instantiation of the classic technique of non-local means [buades2005non, wang2018non]. The perspective that we discuss in this section is one that views self-attention as generating spatially varying filters, in contrast to the reuse of the same filter across every spatial location in standard convolutions [elsayed2020revisiting]. To observe this, we write self-attention and convolution as specific instances of a general spatial pooling function. Given an input , where is the height, is the width, and is the number of input channels, we define a local 2D pooling function that computes an output at location , as

where is a function that returns a weight matrix at every location in a 2D window of size centered at . Note that later in this section, we introduce non-centered windows for self-attention, but we use centering here for ease of explanation. This computation is repeated for every pixel . For a convolution, returns a different linear transformation for each relative distance in neighborhood, and these weights are shared across all . Weight sharing significantly reduces parameters and encourages learning features that repeat spatially. In dot-product relative self-attention [shaw2018self, ramachandran2019standalone, bello2019attention] (eqs. 3 and 2), every pixel in the neighborhood shares the same linear transformation

which is multiplied by a scalar probability that is a function of both content-content and content-geometry interactions resulting in weights that can vary spatially. As an example, for a ball and an orange at two different locations in an image, pixels inside the ball and the orange are likely to generate different

because of the different content around them, such as color or texture.


For self-attention, , , and are learned linear transformations that are shared across all spatial locations, and respectively produce queries, keys, and values when used to transform . Spatial geometry is captured by , which is a learned relative position based embedding. The component captures the content-content interaction between the query pixel and a key pixel in the window. The component is the content-geometry interaction that captures the relationship between the query and the relative position of the key pixel [shaw2018self]. Note that this formulation preserves translational equivariance. If an object translates in an image, for any pixel within the object, the content around it stays the same, generating the same , thereby producing the same output after self-attention. To increase expressivity, multi-headed attention [vaswani2017attention] is used, which repeats this computation multiple times in parallel with different parameters, analogous to group convolutions [krizhevsky2012, xie2017aggregated].

In the SASA model of [ramachandran2019standalone], the local window is a window centered around , just like a convolution. The size of this local window is an important setting to leverage in self-attention. Unlike dense convolutions, can grow without significantly increasing the number of parameters. Since the projection parameters (, , ) are independent of , the only parameters that increase with is . However, constitutes a trivial fraction of the parameters compared to the projection parameters 111 For a window size as large as , and dimensions per attention head, would add only parameters per layer because are shared among heads. In contrast, if the dimensions of the attention layer were , , , would contribute parameters. We show details in the appendix. , so increasing does not not impact the number of parameters of the layer significantly. In contrast, the number of parameters in a convolution layer scale quadratically with (, a convolution has times the parameters of a convolution). On the other hand, the computational cost of self-attention grows quadratically with , preventing the use of very large values for .

2.2 Improving the speed-memory tradeoff by relaxing translational equivariance

Global self-attention, in which all locations attend to each other, is too expensive for most image scales due to the quadratic computation cost with respect to . Thus, multi-scale visual backbones need to use local attention to limit the size of . We follow the intuitive form of local attention developed in [ramachandran2019standalone], which tries to mimic the square neighborhoods used by convolutions. This form of local attention requires extracting local 2D grids around each pixel. Unfortunately, while deep learning libraries automatically handle neighborhood gathering for convolutions, no such neighborhood gathering function exists for local self-attention (or any general local function). Thus, implementing local self-attention requires explicitly gathering the local neighborhoods before the actual self-attention operation can be performed. While the implementation of this local neighborhood gathering function might initially appear to be a relatively minor implementation detail, in practice, it must actually be carefully designed to reduce memory usage while avoiding unnecessary extra computation. An unoptimized implementation can prevent self-attention models from scaling up due to either out-of-memory errors or excessive slowness. The following discussion frames the design considerations of this neighborhood gathering function.

Per Pixel
Per pixel windows
SASA [ramachandran2019standalone] , where
Blocked local (ours)
Table 1: Scaling behavior of self-attention mechanisms. is the number of heads, is the size of the block, is the total number of channels, and is the size of the halo

A straightforward approach would gather sized windows separately around each pixel. As summarized in Table 1 (Row 1), this method blows up the memory used by a factor of due to replicating the pixel contents for each of the neighborhoods it participates in. This solution quickly leads to out-of-memory errors. Global attention (Row 4) is at the other end of the spectrum, where all pixels share the same neighborhood, lowering memory at the expense of considerably more FLOPs 222To illustrate this, on a resolution with channels, global self-attention would incur about times more FLOPs than a convolution with input and output channels. This solution slows down models significantly, while also imposing memory problems due the massize size of the attention matrix. A solution that lies in-between these two extremes should trade-off memory and compute appropriately, with the recognition that a small amount of waste is required.

A compromise solution can be achieved by leveraging the idea that neighboring pixels share most of their neighborhood. For example, two pixels that are right next to each other share pixels of their neighborhoods. Thus a local neighborhood for a block of pixels can be extracted once together, instead of extracting separate neighborhoods per pixel. The FLOPs can be controlled by varying the number of pixels that form a block. We name this strategy blocked local self-attention. The two extremes discussed above are a special case of blocked local self-attention. Global attention corresponds to setting the block size to be the entire spatial extent, while the per-pixel extraction corresponds to setting the block size to be 1.

Figure 1 depicts the different steps involved in executing blocked local self-attention for an image with height , width , and channels with stride . Blocking chops up the image into a tensor of non-overlapping blocks. Each block behaves as a group of query pixels and a haloing operation combines a band of

pixels around them (with padding at boundaries) to obtain the corresponding

shared neighborhood block of shape from which the keys and values are computed. attention operations then run in parallel for each of the query blocks and their corresponding neighborhoods, illustrated with different colors in Figure 1. SASA [ramachandran2019standalone] used the same blocking strategy333Code for both SASA and HaloNet will be made available, along with the checkpoints for HaloNet, setting and uses attention masks to emulate pixel-centered neighborhood windows of size . Our approach For example, to achieve a pixel centered window, [ramachandran2019standalone] set . The use of attention masks gives the operation translational equivariance, since each pixel only looks at a square window around it.

However, the downside of using attention masks is that it wastes computation that must happen regardless due to the implementation of this algorithm. If attention masks are not used, the receptive field increases without any additional computation, as shown in Table 1 (Rows 2 and 3). However, pixel-level translational equivariance is lost because the non-square receptive fields means that the output of a pixel is dependent on which block it falls into. Take for example a pixel at the left edge of its block, which sees additional pixels that are to the right of its square receptive field. If the entire image is shifted one pixel to the right, the pixel now falls into right edge of a neighboring block, and now sees additional pixels that are to the left of its square receptive field. Thus the output of the pixel is dependent on its position in a block, which can change if the image shifts. Another perspective is that blocked local self-attention is only translational equivariant to shifts of size . While pixel-level translational equivariance is considered important for achieving good performance[zhang2019making], we find that empirically, using a non-masked block local self-attention actually improves the accuracy of the model (see Section 4.3). We suspect that the image shifting and cropping perturbations in common data augmentation strategies reduce the reliance on such inductive biases. Thus we adopt unmasked blocked local self-attention because it improves accuracy without sacrificing performance.

Another difference with SASA is our implementation of downsampling. We replace attention followed by post-attention strided average pooling by a single strided attention layer that subsamples queries similar to strided convolutions, as shown in Figure 2. Note that we use the same neighborhood as is extracted in the stride case (Figure 1). This change does not impact accuracy while also reducing the FLOPs in the downsampling layers. We also implement some important algorithmic optimizations that improve our throughput primarily by avoiding reshapes and data formatting operations. In interest of space, we list them in the Appendix D. Taken together, the speedups produced by these improvements are significant as seen in Figure 3, with up to 2 improvements in step time. These improvements can be leveraged to train large self-attention models that were previously too expensive. We leave additional optimizations, such as fused operations and better pipelining of memory accesses with computation, to future work.

Figure 3: Optimizations improve performance. The improvements here are a result of reducing FLOPs with our attention downsampling and improved local self-attention algorithms that avoid reshapes and data formatting. In some cases, we halve the training step time computed on TPU v3.

To conclude this section, it’s important to note that in the deeper layers of multiscale architectures, smaller spatial dimensions and larger channels would shift the compute calculus in favor of global attention. The models we introduce in Section 4, also take advantage of this, typically using local attention in the higher resolutions and global attention when the image resolutions are the smallest.


conv stride 2, 64
max pool stride 2
global average pooling
fc, 1000

HaloNet model family specification.

2.3 HaloNet

Using the implementation of local 2D self-attention with haloing detailed above, we propose a new model, HaloNet that matches state-of-the-art convolutional models on the parameter-accuracy trade-off curve. We leverage the structure of ResNets [he2016deep] that stack multiple residual bottleneck blocks together (see Table 2.2). HaloNet uses a few minor modifications from ResNets: (a) adding a final convolution before the global average pooling for larger models, following EfficientNet [tan2019efficientnet], (b) modifying the bottleneck block width factor, which is traditionally fixed at , (c) modifying the output width multiplier of the spatial operation, which is traditionally fixed at , (d) changing the number of blocks in the third stage from to for computational reasons because attention is more expensive in the higher resolution layers. We also fix the number of heads for each of the four stages to because heads are more expensive at higher resolutions. To summarize, the scaling dimensions in HaloNet are: image size , query block size , halo size , attention output width multiplier , bottleneck output width multiplier , number of bottleneck blocks in the third group , and final conv width . Our attention neighborhoods range from () to ().

Since the ResNet structure was initially designed for convolutions, we suspect that designing architectures specifically for attention may improve HaloNet. In our work we maintained homogeneity across all layers of model for hyperparameters such as the block (

and halo () sizes. We also hope that using automated architecture search methods [tan2019efficientnet] to optimize these hyperparameters for specific accelerators will lead to better local attention architectures. In our work, we train with comparable image sizes as EfficientNet models to determine if attention models can scale to larger images.

3 Related Work

Attention has steadily risen in adoption in vision models in recent years. First introduced in various forms of sequence modeling [graves2013generating, bahdanau2014neural, vaswani2017attention]

, attention was used to attend to image features in the text generation module of image captioning models 

[xu2015show]. Attention is also closely related to non-local means [buades2005non], a pairwise-weighted global sum of pixels originally developed for image denoising. [wang2018non] applied non-local means on top of spatially downsampled convolutional features to improve video classification. However, since these methods scale quadratically with receptive field size, they cannot be used because the spatial size is too large. In order to apply self-attention, [parmar2018image] applies local attention on images for the task of image generation. [bello2019attention] spatially downsample the features for attention and concatenate the attention outputs to convolutional features. Instead, we directly build on top of the approach of [ramachandran2019standalone], who compute attention on local regions in order to build a fully self-attentional vision model for classification and object detection. Different forms of attention for pure self-attention vision models have also been proposed [hu2019local, zhao2020exploring], which are orthogonal and complementary to the focus on scaling in this work. In addition to attention over the spatial extent that we focus on, components that perform attention over channels have also been used to augment convolutional models [hu2018squeeze, li2019selective]. In recent and concurrent work, Vision Transformer [dosovitskiy2020image] show that applying transformers on projections of non-overlapping image patches can achieve accuracies comparable to SOTA when pre-trained on very large (JFT-300M [sun2017revisiting]) and medium sized (ImageNet-21k [imagenet_cvpr09]) classification datasets. However, their models do not adopt a multiscale architecture and our focus in this work is training on ImageNet [russakovsky2015imagenet] from scratch. In Section 4.5, we conduct transfer experiments and compare with ViT and BiT [kolesnikov2019big].

Generally, the performance of computational primitives tend to improve over time due to algorithmic changes to the primitive and better software implementations. In particular, convolution have improved over the last decade through changes in (a) the computation of the primitive [chellapilla2006high, jia2014caffe, mathieu2013fast, vasilache2014fast, winograd1980arithmetic, lavin2016fast]; (b) the software implementation [chetlur2014cudnn]; (c) the structure of the primitive itself, through for example, grouped convolution [xie2017aggregated] and depthwise separable convolution [sifre2014rigid]. Attention is in the beginning phases of this performance improvement trajectory, and given its importance in sequence modeling [vaswani2017attention], it will likely see sustained effort to enhance performance. Local attention could also receive performance improvements if it is adopted more widely to combat the general problem of processing large inputs. Our work introduces blocked local attention to efficiently process immediate neighbors. Other forms of non-global pixel interaction can also be implemented efficiently [child2019generating, ho2019axial, wang2020axial, bello2021lambdanetworks].

4 Experiments

Each HaloNet model (H0–H7) is designed by successively growing the values of the hyperparameters defined in Table 2.2. In interest of space, we leave the exact configurations of our models to the Appendix C.2. We also leave the training and evaluation of larger HaloNet models that compare with larger EfficientNet models for future work.

Figure 4: HaloNets can match EfficientNets on the accuracy vs. parameter trade-off. The accuracies for EfficientNets B5 and B7 were obtained using RandAugment.

4.1 HaloNets are competitive with state-of-the-art convolutional models

We train our HaloNet models on ImageNet [russakovsky2015imagenet] (ILSVRC-2012) benchmark with a batch size of and learning rate of

, which is linearly warmed up for 10 epochs and followed by cosine decay 

[loshchilov2016sgdr]. The models are trained for epochs with Nesterov’s Accelerated Gradient [nesterov1983, icml2013_sutskever13], and regularized with dropout [srivastava2014dropout], weight decay, RandAugment [cubuk2019randaugment] and stochastic depth [huang2016deep].

We find that HaloNets perform at par or slightly better (Figure 4) than EfficientNet models for the same parameters, outperforming other model families. Our best model, H7, achieves 84.9% top-1 ImageNet validation accuracy and 74.7% top-1 accuracy on ImageNet V2 [recht2019imagenet] (with a -0.5% gap to the linear fit in [recht2019imagenet]). For each of our HaloNet models, we use image sizes comparable to the corresponding EfficientNet model, training on images sizes up to . (Table A2). For a comparison of our latencies with EfficientNet, the reader can refer to Section 5. To the best of our knowledge, these results are the first to show that self-attention based models for vision perform on par with the SOTA for image classification when trained on imagenet from scratch. Note that for all our experiments, we report accuracies at the end of training and we tune regularization hyperparameters such as augmentation hyperparameters for the baselines and HaloNet models.

4.2 Model study 1: comparing self-attention and convolutions

In the following sections, we will focus on model studies to distinguish the advantages of self-attention over convolutions for vision and and understand how to best design self-attention vision architectures. This knowledge is important since much of the progress in convolutional networks comes from improvements in architecture design while keeping the core convolution primitive the same [krizhevsky2012, szegedy2016rethinking, he2016deep]. We believe our study is the first to explicitly examine the design of optimal self-attention vision architectures.

For the remainder of the experimental section, we compare with ResNet-50 [he2016identity], the canonical vision model, because many of the components that we ablate have been well studied for ResNet-50, allowing us to use best practices for the baseline model. We tune our baseline ResNet-50 implementation to achieve a better accuracy, 77.6%, compared to commonly reported numbers in the literature. For example,  [he2016deep] report 76.3%. We then create a new HaloNet architecture, HaloNet-50, that exactly matches the ResNet-50 architecture by replacing spatial convolutions with local self-attention. HaloNet-50 and ResNet-50 have about million and million parameters respectively. We train both for epochs on image size. We share other training details of the ablation set-up in the appendix

4.2.1 Transfer of convolutional components to self-attention

Utilizing regularizations and architectural modules beyond the core primitive is critical for achieving strong results [he2019bag]. In this section, we study the effects of these additional components on self-attention models. The components we study were all designed for use in convolutional models, as they were developed through experimentation (either human or automated search) on convolutional models. We examine whether these components can successfully transfer to the new model family of self-attention networks.

We focus on 4 different components based on the design of EfficientNet [tan2019efficientnet], 2 architecture modules and 2 regularizations: Squeeze-and-Excitation (SE) [hu2018squeeze], a channel attention module used after the spatial convolution; SiLU/Swish-1 [ramachandran2017searching, elfwing2018sigmoid, hendrycks2016gaussian]

, an activation function with the form

; RandAugment (RA) [cubuk2019randaugment], a data augmentation scheme that simplifies AutoAugment [cubuk2019autoaugment]; and Label Smoothing (LS) [szegedy2016rethinking], a smoothing of the label distribution.

The results of adding these various components to the baseline is in Table 2. Suprisingly, regularizations of the same strength improve HaloNet accuracies significantly more than ResNet, despite HaloNet having around 30% fewer parameters than ResNet. When label smoothing and RandAugment are added, HaloNet improves by 1.3% while ResNet improves by 0.8%. This result suggests that self-attention models may require regularizations that are typical of larger convolutional models, perhaps due to the expressivity of self-attention.

However, the architecture modules that were developed for convolutional models only improve attention models by a small amount. When Squeeze-and-Excitation (SE) and SiLU/Swish-1 are added, ResNet improves by 1.3% while HaloNet only improves by 0.4%. We speculate that HaloNet models benefit from the gating and multiplicative interactions that comprise self-attention and do not need explicit gating such as SE. Further research must be conducted in order to discover architecture modules that can consistently improve a variety of self-attention models. Inspired by these findings, we decided to use label smoothing, SiLU/Swish-1, and RandAugment in our HaloNet models. We also use stochastic depth for our larger models [huang2016deep, tan2019efficientnet].

Baseline 78.6 0.0 77.6 0.0
+ LS 79.7 1.1 78.1 0.5
+ LS, RA 79.9 1.3 78.4 0.8
+ SE 78.6 0.0 78.6 1.0
+ SE, SiLU/Sw1 79.0 0.4 78.9 1.3
+ LS, SE 79.7 1.1 78.9 1.3
+ LS, SE, SiLU/Sw1 79.9 1.3 79.1 1.5
+ LS, SE, SiLU/Sw1, RA 80.5 1.9 79.5 1.9
Table 2: HaloNet improves more than ResNet with regularizations, but does not improve significantly with architectural modules that strongly benefit ResNet. Starting from a baseline model, adding label smoothing (LS), RandAugment (RA), Squeeze-and-Excitation (SE), and SiLU/Swish-1 (SiLU/Sw1).

4.2.2 Increasing image sizes improve accuracies

A beneficial property of self-attention is attention is that the receptive field size can scale along with image size without significantly impacting the number of parameters (see Section 2.1). As shown in Figure 6, HaloNet consistently improves when using larger images. Although we also see improvements with convolutional models, the accuracy gap between HaloNets and ResNets is maintained.

4.3 Model study 2: HaloNet architecture study

Figure 5: Relaxing translational equivariance improves accuracies

In this section, we will study the impact of relaxing translational equivariance and the relationship of neighborhood window and halo sizes. In the interest of space, a detailed study of scaling various components of our models such as , etc can be found in the Appendix B.

Figure 6: The accuracy gap between HaloNet-50 and ResNet-50 is maintained with increasing image sizes. The HaloNet experiments are annotated with block size (), halo size ().
Relaxing translational equivariance:

In Figure 5, we see that HaloNet-50 with , and achieves better accuracies using the same block and halo to achieve neighborhoods with attention masks  [ramachandran2019standalone] and the gap widens with more regularizations. This suggests that larger receptive fields are more important than inductive biases such as translational equivariance.

Window and halo size:

When using the blocked input format, there are two ways of changing the window size of attention: changing the query block size or the halo size. For the same window size , smaller query blocks and larger halos require more memory than larger query blocks and smaller halos, as discussed in section 2.2.

We see in Figure 7 that accuracy consistently improves as the window size increases. In particular, doubling the window size from to produces a accuracy gain. These results suggest that increasing window size can be successfully used to scale models without increasing the number of parameters, potentially beneficial for production environments. Furthermore, for a fixed window size, the choice of query block size does not impact results, enabling the usage of larger query block sizes to reduce memory. Figure 7 also shows that eschewing haloing for non-overlapping attention, can lower accuracy significantly unless the blocks are quite large. For example using a block size of and a halo of results in better accuracy than using a block size of with halo, despite a smaller neighborhood size.

4.4 Convolution-Attention hybrids improve the speed-accuracy tradeoff

In our final set of ablations, we replace self-attention with convolutions to understand where attention layers are currently most beneficial. In Table 3, we show results for replacing attention layers with convolutions with squeeze-and-excitation modules in each of the stages of our best performing model (HaloNet H7). Having convolutions in all stages except the last yields the fastest model albeit with a significant loss in top-1 accuracy (1%). Splitting the allocation between convolutions (in stages 1–2) and attention (in stages 3–4) minimally detriments predictive accuracy while significantly improving training and inference step times. We leave a detailed study of improved hybrid models for future work.

Acc (%)
- 1, 2, 3, 4 84.9 1.9
1 2, 3, 4 84.6 1.4
1, 2 3, 4 84.7 1.0
1, 2, 3 4 83.8 0.5
Table 3: Replacing attention layers with convolutions in stages 1 and 2 exhibit the best speed vs. accuracy tradeoff. All the models had about million parameters and the train and inference times are normalized to the corresponding times for EfficientNet B7. Please see Figure 8 for a comparison of step time with other HaloNet models.
Figure 7: Increasing window sizes improves accuracy up to a point. The experiments in the graph have been annotated with their block size (), halo size (), implies attention with non-overlapping blocks
 R50 baseline in lit 42.1 22.5 44.8 59.1 37.7 18.3 40.5 54.9 409 14.6
 R50 + SE (our baseline) 44.5 (+2.4) 25.5 47.7 61.2 39.6 (+1.9) 20.4 42.6 57.6 446 15.2
R50 + SE + Local Att () 45.2 (++0.7) 25.4 48.1 63.3 40.3 (++0.7) 20.5 43.1 59.0 540 15.8
R50 + SE + Local Att () 45.4 (++0.9) 25.9 48.2 63.0 40.5 (++0.9) 21.2 43.5 58.8 613 16.5
R101 + SE (our baseline) 45.9 (+3.8) 25.8 49.5 62.9 40.6 (+2.9) 20.9 43.7 58.7 740 17.9
R101 + SE + Local Att ( 46.8 (++0.9) 26.3 50.0 64.5 41.2 (++0.6) 21.4 44.3 59.8 799 18.4
Table 4: Accuracies on object detection and instance segmentation. We experiment with two settings for self-attention in the last stage: A block size of () of and a halo size () of and also with () for ResNet-50. (bounding box) refers to detection, and (mask) refers to segmentation. The identifiers , , and refer to small, medium, and large objects respectively. Speed is measured as the milliseconds taken by only the backbone (and not the FPN) for a batch size of on TPUv3 cores. The train time the total training time calculated from the peak images/sec of the Mask-RCNN training run on 8 TPUv3 cores with a batch size of 64.
Pretraining Image Size (Pixels)
Step Time
(32 per core)
Image Size
Accuracy (%)
H4 (base 128) 85 256 377 ms 384/512 85.6/85.8 121.3/48.6
H4 (base 128, Conv-12) 87 256 213 ms 384/512 85.5/85.8 257.6/120.2
ViT-L/16 300 224 445 ms 384/512 85.2/85.3 74.6/27.4
BiT-M 928 224 1021 ms 384 85.4 54.2
Table 5: HaloNet models pretrained on ImagetNet-21k perform well when finetuned on ImageNet. For HaloNet and ViT, we finetuned on and size images. The pretraining step time reports the TPUv3 compute time for a batch size of 32 per core. The inference speed is also computed on a single TPUv3 core.

4.5 Transfer from ImageNet-21k

Our experiments thus far have focused on training from scratch on ImageNet-ILSVRC-2012 [russakovsky2015imagenet], where regularizations and longer training are critical for good accuracies. Papers such  [dosovitskiy2020image, kolesnikov2019big] have shown that a short finetuning step after pretraining models on larger labelled datasets such as ImageNet-21k [imagenet_cvpr09] or JFT-300M [sun2017revisiting] can achieve better accuracies without the need for regularization. To understand the transfer properties of HaloNet models, we scale up HaloNet-H4 by increasing the base width to and evaluate the transfer protocol from [kolesnikov2019big], pretraining on the public ImageNet-21k dataset, and finetuning on ImageNet. Following our observation in Table 3, we also train a hybrid version of this model with convolutions in the first two stages. For a fair comparison with [kolesnikov2019big], we do not use squeeze-and-excitation [hu2018squeeze] in the stages with convolutions. The details of the models can be found in the Appendix E. ImageNet-21k contains million annotated images, and 21k labels, both an order of magnitude larger than ImageNet. Following [kolesnikov2019big], we pretrain for epochs with a batch size of , and a base learning rate of , which is linearly warmed up for epochs followed by cosine decay [loshchilov2016sgdr]. We also use a weight decay of , and train with Nesterov’s Accelerated Gradient [nesterov1983, icml2013_sutskever13] during pretraining and finetuning. We pretrain on size images and finetune on different image sizes, as shown in Table 5. Our wider H4 and hybrid-H4 models achieves better accuracy than the Vision Transformer and a wide ResNet-152 from [kolesnikov2019big] and are also faster at inference on larger images. We finetune for epochs on ImageNet, initializing with the parameters learned from pretraining except for the label embedding matrix, which is initialized to zeros. We train with a batch size of , a learning rate of and cosine decay after linearly warming it up for epochs. We benefit from finetuning with a label smoothing of during finetuning despite pretrainig on a larger dataset. We do not use Polyak averaging [polyak1992acceleration], and other regulariations during finetuning.

We believe our preliminary results on transfer are promising since we achieve better parameter-accuracy and speed-accuracy tradeoffs than other models on this dataset. We leave the study of transfer with larger HaloNet and HaloNet hybrids for future work. The speed advantages of our models on larger images make them desirable for challenging structured prediction tasks on large images such as object detection and instance segmentation, which we briefly explore in the next section.

4.6 Detection and instance segmentation

To understand if our primitives will generalize to structured prediction tasks on larger images, we conduct initial investigations with the simple attention-convolutional hybrids on detection and instance segmentation, using the Mask R-CNN [he2017mask] framework. These hybrids are also faster and consume less memory than pure attention models, enabling faster experimental cycles. We only replace the last 3 convolutional layers in the ResNet-50 and ResNet-101 backbones with two halo layers with block size, and halo size (Rows 3 and 6 in Table 4). For ResNet-50, we also examine using and halo size to understand benefits from larger receptive fields. We also use squeeze-and-excitation with convolutions and pre-train them on images with the regularizations mentioned in Section 4.2.1: label smoothing, RandAugment, and stochastic depth. We train our models on the COCO dataset [lin2014microsoft] with size images for epochs, using the Cloud TPU Detection Codebase 444 We provide more training details in the Appendix C.4.

Our ResNet-50 baseline in row 2 of Table 4, is significantly better than what is usually reported in the literature (row 1). Our attention variants achieve at least 0.7 mAP gains on bounding box detection and at least 0.6 mAP gains on instance segmentation on top of our stronger baselines (denoted by ++ in rows 3, 4 and 6 in Table 4). The gain from local attention with block size closes half of the mAP gap between the R50 and R101 baselines in detection and 70% of the gap in instance segmentation despite being less than a third of the gap in terms of wall-clock time. Local attention with and also improves on top of the deep R101 backbone. Interestingly, localization of large objects () shows the largest improvement when attention is used. Larger block sizes ( in row 4) achieve very close performance to while being slower. However, we see that does much better than on small objects (). Future work can combine the best of these two settings. Note that with , the last two attention layers do global attention since the image is downsampled to pixels in each spatial dimension. Concurrent work, BoTNet [srinivas2021bottleneck], uses global self-attention in ResNet-Attention hybrids for structured prediction tasks and classification. See [srinivas2021bottleneck] for additional details on the efficacy of global attention for localization tasks

These models have only three layers of self-attention, and more layers could alter these results. We leave the study of detection and instance segmentation with pure attention models to future work.

5 Discussion

Figure 8: Pure attention based HaloNet models are currently slower to train than efficient net models. The times are the TPUv3 compute time needed to process a batch size of per core. The points in green with annotations C1, C12, and C123 correspond to the hybrid models with convolutions in stages 1, 1–2 and 1–3 respectively. (see Table 3).

In this work, we built multiscale self-attention models that are competitive with the best convolutional models. To achieve this result, we developed two attention improvements: blocked local attention and attention downsampling. We also performed multiple ablations to understand how to improve the scaling of self-attention models.

Our results demonstrate that self-attention is competitive accuracy-wise when training on ImageNet from scratch. Figure 8 shows that pure self-attention555By pure attention we mean models that use self-attention in all layers except the stem, which is convolutional. based HaloNets are currently slower to train than the corresponding EfficientNets and require further optimizations for large batch training. However, our hybrids have the same speed-accuracy tradeoff as EfficientNets. On transfer from ImageNet-21k, our models outperform very strong models such as BiT [kolesnikov2019big] and ViT [dosovitskiy2020image], on both accuracy and speed. Model optimizations, such as using architecture search methods to find better speed-accuracy tradeoffs or different forms of more powerful and/or efficient attention forms [zhao2020exploring, roy2020efficient]

, are promising directions for machine learning researchers. Implementation optimizations, such as better memory management, can improve the practicality of these models. Also, scaling up our models to larger widths might cause our operations to transition from being memory bound to compute bound, and lead to better speed-accuracy tradeoffs. We leave this study for future work. Overall, our work shows that self-attention can be competitive in regimes traditionally dominated by computer vision. Future work can push these boundaries further, both in terms of scale and efficiency.

6 Acknowledgements

We would like to thank David Fleet for valuable discussions. We would also like to thank Irwan Bello, Barret Zoph, Mingxing Tan, and Lucas Beyer for valuable commentary on earlier drafts of the paper.



A Relative embeddings add very few parameters to the model

Our parameters grow very slowly with receptive field. In this section, we will show that the number of parameters in the relative embeddings, the only spatially dependent parameters, is quite small. As described in the paper, the output of local 2D self-attention at position is computed as:


where the queries , keys , and values are linear transformations of the pixels, and is a learned relative position based embedding. Following the Transformer [vaswani2017attention], we also use multihead attention, where we run multiple instances of the self-attention in parallel with different parameters. However, each head shares the parameters for the relative embeddings . For an attention window of size around each pixel, we factorize the relative embeddings along height and width following [ramachandran2019standalone], and we allocate half the channels within a head to each of these. Keeping the dimension per head fixed at as mentioned in the paper, this gives a constant parameters per attention layer layer for . In contrast, if the channels in an attention layer are , then each of the three linear transformations has parameters. Thus the ratio of parameters in the relative embeddings as compared with the linear projections is , which is small for typical values of and .

Dimension Values Accuracy
Baseline Scaled
Layers 50 98 81.4 0.9
1.0 3.0 81.0 0.5
1.0 1.25 80.9 0.4
4.0 6.5 80.6 0.1
1.0 6.5 80.3 -0.2
Table A1: Increasing the number of channels for the values and number of layers has the most impact on accuracy.
Params (M)
Image Size (M)
H0 8 3 1.0 0.5 50 7 256 5.5 B0: 5.3 224
H1 8 3 1.0 1.0 59 10 256 8.1 B1: 7.8 240
H2 8 3 1.0 1.25 62 11 256 9.4 B2: 9.2 260
H3 10 3 1.0 1.5 65 12 320 1024 12.3 B3: 12 300
H4 12 2 1.0 3 65 12 384 1280 19.1 B4: 19 380
H5 14 2 2.5 2 98 23 448 1536 30.7 B5: 30 456
H6 8 4 3 2.75 101 24 512 1536 43.4 B6: 43 528
H7 10 3 4 3.5 107 26 600 2048 67 B7: 66 600
Table A2: Configurations of HaloNet models, each of which matches a model from the EfficientNet family in terms of parameters. The number of heads in the four stages are . The notations are: image size , query block size , halo size , attention output width multiplier , bottleneck output width multiplier , number of bottleneck blocks in the third group , and final conv width

B Study of enlarging self-attention models

In Section 4.3, we presented some scaling properties of our models. In Table A1, we try to understand which other parts of our models most impact accuracy. For our study, we increase the size of HaloNet-50 by scaling different hyperparameters to reach a parameter budget of million. We find that adding more computation in the attention by increasing and adding more layers are most fruitful scaling dimensions for increasing accuracy.

C Experimental details, hyperparameters

In this section, we list the experimental details and model configurations that were omitted from the main body in interest of space

c.1 Experimental details for model studies

In Sections 4.2 and 4.3, all the HaloNet-50 models use the same layer allocations and channels widths as the standard ResNet-50 [he2016deep] model. Both ResNet-50 and HaloNet-50 models were trained for epochs on size images with a learning rate of . For the experiments with RandAugment, we used a weight decay of for the settings that used RandAugment [cubuk2019randaugment], and otherwise. Using a weight decay of with RandAugment seemed to have a negative impact on accuracies with ResNet-50. We used a RandAugment magnitude of in these sections. For HaloNet-50, we used a block size , and halo . We fixed the number of channels per head to be . For the SASA models in section, we used a pixel centered window of size following [ramachandran2019standalone].

c.2 HaloNet Models

In Table A2, we describe the configurations of our HaloNet models, . The hyperparameters in the HaloNet family are: image size , query block size , halo size , attention output width multiplier , bottleneck output width multiplier , number of bottleneck blocks in the third group , and final conv width . Each of our HaloNet models is trained on a comparable image size to the corresponding EfficientNet [tan2019efficientnet] model, which can be found in Table A2.

c.3 Classification hyperparameters

In this section we complete the details of our training and regularization setup. We used a weight decay of and using a cosine annealing scheme [loshchilov2016sgdr] with learning rate . The largest models consistently overfit at the very end of training, which we attribute to the learning rate going to 0 at the end of training [yu2020bignas]. To combat this, we set the end of the cosine annealing to be of the original learning rate instead of . For RandAugment [cubuk2019randaugment], we grow our RangAugment magnitudes for the smallest to the the largest models as and . Note that we have not extensively tuned the RandAugment magnitudes.

c.4 Detection and instance segmentation hyperparameters

We use Mask-RCNN [he2017mask] for all detection and instance segmentation experiments. We pretrain the backbone on ImageNet, mostly reusing the same hyperparameters as in Section C.3. Backbones are pretrained for epochs using an image size of , which was chosen to be closer to the image size used in detection setting. The models were regularized with RandAug at a magnitude of and stochastic depth with probability , and use Squeeze-Excitation with a reduction factor of

. The detection code and hyperparameters directly used the open-source TPU detection and segmentation framework

666 During the detection / instance segmentation phase, the backbone is initialized with the pretrained weights, while the other parameters are initialized from scratch. The model is trained for steps with x learning rate decays at and steps, uses a learning rate of in SGD with momentum, a warmup of steps with a fixed learning rate of , a batch size of spread across TPUv3 cores, image size, an L2 weight decay of , and multi-scale jitter with magnitudes between .

D Optimizations

We endeavor to avoid data formatting operations whenever possible, which can slow down the model, resulting in the following two key optimizations

  • Persistent blocking: Once the image is blocked, we flatten the blocks to sequences of length

    , and we do not reshape it back to 4D until the end of the network, implementing operations such as batch normalization 

    [ioffe2015batch] to handle the blocked format. The image is thus processed in 5D: instead of ).

  • Gathers with convolutions: The haloing described in Section 2.2 is also carried out in 5D resulting in flattened neighborhoods. For speed, we implement haloing with 3D convolutions used as gathering operations instead of slices and concatenations.

E ImageNet-21k Models

For our ImageNet-21k transfer experiments Table 5), we make 3 changes to our HaloNet H4 model (See Table A2 for specification of the H4 model). To increase the number of parameters in the model body, We increase the base width to 2.0 (Making the base width 128, twice the normal width), and we also change from to the default . We remove the final extra convolution, so that the label embeddings have a large number of filters to account for the larger number of labels. For the hybrid model, we use convolutions in the first two stages.

For pretraining on images, we set and . For finetuning on images, we use , , and for finetuning on size images, we use ,

. When transferring the pretrained model, we initialize all the parameters from the pretrained checkpoint at the final step of pretraining except for the label embeddings, which are initialized to zeros, and the relative embeddings, that are initialized by linearly interpolating from the ones learned at pretraining.

F Impact of relative position encodings

[ramachandran2019standalone] showed that using relative position was important for achieving good accuracies. We find the same outcome with HaloNet. Using absolute factorized abosolute position encodings, which are added to the activations before local self-attention in every layer, drops accuracy from to (the first row in Table 2) to