Locally Shifted Attention With Early Global Integration

by   Shelly Sheynin, et al.

Recent work has shown the potential of transformers for computer vision applications. An image is first partitioned into patches, which are then used as input tokens for the attention mechanism. Due to the expensive quadratic cost of the attention mechanism, either a large patch size is used, resulting in coarse-grained global interactions, or alternatively, attention is applied only on a local region of the image, at the expense of long-range interactions. In this work, we propose an approach that allows for both coarse global interactions and fine-grained local interactions already at early layers of a vision transformer. At the core of our method is the application of local and global attention layers. In the local attention layer, we apply attention to each patch and its local shifts, resulting in virtually located local patches, which are not bound to a single, specific location. These virtually located patches are then used in a global attention layer. The separation of the attention layer into local and global counterparts allows for a low computational cost in the number of patches, while still supporting data-dependent localization already at the first layer, as opposed to the static positioning in other visual transformers. Our method is shown to be superior to both convolutional and transformer-based methods for image classification on CIFAR10, CIFAR100, and ImageNet. Code is available at: https://github.com/shellysheynin/Locally-SAG-Transformer.



There are no comments yet.


page 4


Focal Self-attention for Local-Global Interactions in Vision Transformers

Recently, Vision Transformer and its variants have shown great promise o...

GiT: Graph Interactive Transformer for Vehicle Re-identification

Transformers are more and more popular in computer vision, which treat a...

Visualizing and Understanding Patch Interactions in Vision Transformer

Vision Transformer (ViT) has become a leading tool in various computer v...

Transformer Compressed Sensing via Global Image Tokens

Convolutional neural networks (CNN) have demonstrated outstanding Compre...

Hopfield Networks is All You Need

We show that the transformer attention mechanism is the update rule of a...

Mobile-Former: Bridging MobileNet and Transformer

We present Mobile-Former, a parallel design of MobileNet and Transformer...

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

While transformers have shown great potential on video recognition tasks...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks have dominated computer vision research and enabled significant breakthroughs in solving many visual tasks, such as image classification [krizhevsky2012imagenet, simonyan2014very] and semantic segmentation [long2015fully]. Typically, CNN architectures begin by applying convolutional layers of a small receptive field for low-level features, resulting in local dependencies between neighbouring image regions. As processing continues and features become more semantic, the effective receptive field is gradually increased, capturing longer-ranged dependencies.

Inspired by the success of Transformers [vaswani2017attention] for NLP tasks, a new set of attention-based approaches has emerged for vision-based processing. The Vision Transformer (ViT) [dosovitskiy2020image] is the first model to rely exclusively on the Transformer architecture for obtaining competitive image classification performance. ViT divides the input image into patches of a fixed size and considers each patch as a token to which the transformer model is applied. The attention mechanism between these patches results in global dependencies between pixels, already at the first transformer layer. Due to the quadratic cost of the attention mechanism [vaswani2017attention, dosovitskiy2020image] in the number of patches, fixed-size partitioning is performed. As a result, ViT does not benefit from the built-in locality bias that is present in CNNs: neighbouring pixels within a patch may be highly correlated, but this bias is not encoded into the ViT architecture. That is, ViT encodes inter-patch correlations well, but not intra-patch correlations. Further, each image may require a different patch size and location, depending on the size and location of objects in the image.

Method Token Embedding Hierarchy Early Layer Attention Late Layer Attention
ViT [dosovitskiy2020image] Linear (non-overlapping) No Global Global
DeiT [touvron2020deit] Linear (non-overlapping) No Global Global
PvT [wang2021pyramid] Linear (non-overlapping) Yes Global Global
CvT [wu2021cvt] Partial Convolution (overlapping) Yes Partially Local + Global Partially Local + Global
NesT [zhang2021aggregating] Convolution (non-overlapping) Yes Local Local + Global
Swin [ze2021swin] Linear (non-overlapping) Yes Local Local + Global
ViT [xiao2021early] Convolution (overlapping) No Global Global
Ours Convolution (overlapping) Yes Local + Global Global
Table 1: A comparison of our method with baselines on key elements. We consider the Token Embedding in terms of the type of embedding and whether overlapping patches are considered tokens. Next, we consider whether a Hierarchy is used, in which a pyramid of attention layers is applied to decreasing input resolutions. We then consider the type of attention used. We divide this into Early Layer Attention, in which low-level features are learned, and Late Layer Attention, in which higher-level features are learned. CvT

applies only partial convolution, which includes depth-wise separable convolution with a stride of

(as opposed to ) for the key and value projection. As such, only part of the local neighbourhood of patches is considered for both early and late layers. In NesT and Swin, only a local neighborhood of patches is considered, but as the hierarchy progresses, coarser patches are considered, so a local neighborhood covers almost the entire spatial dimension, resulting in a global-like attention. Our method is unique in that it applies fully local and global attention already at the first layer, with full convolutional embedding over the entire local neighborhood of patches. This is possible due to the separation of the attention into local and global components, which reduces overall complexity.

Recent approaches [ze2021swin, zhang2021aggregating, wang2021pyramid, wu2021cvt] have attempted to alleviate the need for this fixed-size partition, thus enjoying some of the benefits of CNNs. PvT [wang2021pyramid], for instance, applies attention in a pyramidal fashion, with increasing patch size at each level. However, an initial fixed partition of the image into non-overlapping patches is still performed, so finer sub-patch correlations are still not captured. In another line of work NesT [zhang2021aggregating] and Swin [wang2021pyramid], apply attention in a localized fashion, over local regions in the image. This results in the inability to capture global correlations between distant patches in the image. CvT [wu2021cvt] considers overlapping patches, thus capturing both inter-patch correlations and intra-patch correlations. As a result, a large number of patches is considered, and CvT does not scale well to large images due to the quadratic cost of the attention mechanism in the number patches.

Our approach combines the locality bias of CNNs, for both coarser and finer details, with the ability to attend globally to all patches in the image. This is done already at the first layer, for low-level features. The method scales well to large images, since it does not incur the prohibitive quadratic cost of considering all overlapping patches. It is based on the observation that the optimal location for each patch varies from image to image, depending on object locations and sizes. Therefore, instead of considering a single patch at a given location, we consider an ensemble of patches. This ensemble consists of the conventional fixed patch location and of the patches obtained by small horizontal and vertical shifts of each patch. By employing this shift property, the ensemble can capture more precisely the finer details of the object patches, which are necessary for the downstream task.

To avoid the expensive quadratic cost of computing self-attention over all ensembles of patches, we split each attention layer into two consecutive attention operations, which accumulate both local and global information for each patch. In the local attention layer, we apply self-attention to each patch with its local shifts. This step allows the fixed patches to gain information from a rich collection of variations, where each variant represents an alternative to the location of the patch. This way, we construct a virtually located patch as a weighted sum of all possible shifts of the fixed patch. In the global attention layers, we utilize the virtually local patches and apply the standard global self-attention between them. This step allows each patch to gain global information from all other patches, where each patch was optimized in the previous local layer by considering all of its local shifts. Global attention is applied in a number of layers, in a pyramidal fashion, where at each layer a coarser resolution is considered.

Our method obtains state-of-the-art performance in image classification on a variety of different model sizes for both CIFAR10, CIFAR100 [krizhevsky2009learning] and ImageNet [deng2009imagenet]. At the same time, our method is efficient and can scale well to large image resolutions.

2 Related Work

Transformers for Vision Models The Transformer, first introduced in [vaswani2017attention], revolutionized the field of NLP. Multiple attempts have been made to incorporate attention-based techniques for image classification. The Vision Transformer (ViT) [dosovitskiy2020image] marked a turning point for Transformers in vision. ViT showed excellent results compared to existing CNN-based networks, while requiring fewer computational resources. Later attempts incorporated the locality bias of CNNs within a transformer architecture. DeiT [touvron2020deit] introduced a teacher-student strategy specifically for Transformers, using an additional distillation token, in which the teacher is a CNN. This enabled training vision Transfromers with the standard ImageNet dataset, removing the need of ViT to utilize a larger-scale pre-training dataset. VIT [xiao2021early] showed that applying a convolutional stem on input patches, instead of a linear one, and only then applying standard VIT-like attention blocks, boosts performance. These approaches do not use a hirearchy of attention blocks with decreasing input resolution, and assume a fixed partion of the image intro patches.

Subsequent approaches introduced such a hierarchy into their design. PvT [wang2021pyramid] adopt a pyramidal structure, where the input spatial resolution is decreased, similarly to CNNs, and the patch size is gradually increased. However, an initial fixed partition of the image into non-overlapping patches limits the ability of the attention mechanism to capture finer sub-patch details.

Local and Global Attention Approaches such as the Swin Transformer [ze2021swin] and NesT [zhang2021aggregating] attempt to alleviate this issue by focusing on localized self-attention in non-overlapping blocks, and aggregating the blocks. This partition considers very small, even pixel-sized patches, so correlation between surrounding pixels can be considered, thus reinforcing the locality bias of CNNs. However, this comes at the expense of considering only a part of the image at a given scale, while ignoring global dependencies between all patches in the image.

CvT is a recent approach by [wu2021cvt] that applies the attention mechanism over overlapping patches of the image at different scales, thus capturing finer details along with global dependencies between distant patches. Different overlapping patches are encoded using convolution. However, due to the quadratic cost of the attention mechanism, not all local overlapping patches can be considered. Hence, a stride of 2 (as opposed to 1) is used for the key and value projection. Further, the convolution operation is restricted to be a depth-wise separable convolution, restricting the space of possible solutions.

Unlike previous methods, our method separates the attention mechanism into a local stage and a global stage, and applies both the local and global components already at the first layer. At the local stage, finer details can be aggregated over the local region of each patch, producing new patches, which incorporate fine details. At the global stage, a standard attention computation is applied between all newly aggregated patches. Following the first layer, subsequent aggregation is done in a pyramidal fashion over a number of scales, resulting in an efficient transformer architecture, which can scale well to large images, and which incorporates the locality bias at both fine and coarse scales.

Tab. 1 shows the key differences of our method from previous work. Crucially, our method applies non-partial local and global attention at the very first layer. As discussed in Sec. 3, assuming that tokens result from the patch embedding (non-overlapping patches), we can consider a large number of patch variants (at most ), while the overall complexity remains . This results in a much larger number of tokens being used in our attention mechanism, while also considering the entire image region globally.

3 Method

We first describe the formation of the patch embeddings. Then, we discuss the two types of attention layers: the local attention layer operates locally on each patch and its shifts, while the global attention is applied globally, using a hierarchy. Finally, we describe the implementation details.

3.1 Local Shift Embeddings

Figure 1: (a) Illustration of six image shifting variants out of the total , as defined in Sec. 3.1. A grid is shown over each image shift variant. In each variant, the grid represents a different division of the image to patches. Patches shown in light blue and red indicate two different patch-shifting variants. The color of the grid is indicative of the variant index. (b) The same six variants are shown for the light blue (top) and red (bottom) patches in (a). The figure illustrates the sub-optimality of the fixed partition used in previous work. For example, the dog’s left eye is fully captured only in the ”orange partition” and the nose only in the ”yellow partition”. A fixed partition would only capture one of those concepts.

Given an image , we consider a partition of the image into patches of size , resulting in a map of patches, similarly to ViT [dosovitskiy2020image]. For each patch of this map, we consider shifted patches. A patch is created by shifting patch , pixels horizontally and pixels vertically, where and . Each such shift is identified by the pair .

This process results in patches of size

. The subset of chosen shift variants among all possible shifts is a hyperparameter described in Sec. 

4; In Tab. 6, we report our results with subsets of varying size.

For each of the variants of patch , we construct a dimensional embedding. These embeddings are obtained by shifting the entire image by and

, and applying circular padding at the edges. Each resulting image, referred to as an image shifting variant, is then passed through a convolutional layer with

kernel, stride and output channels. This results in feature maps of size . One can view each feature map as tokens of size . We define to be the token corresponding to patch and shift index , which corresponds to shift . An illustration of our shifting variants is provided in Fig. 1.

Positional embedding A learned positional embedding is added for each patch and variant convolutional embedding . That is, , where is a learned positional embedding of size . In Sec. 4.1, we consider a different strategy, whereby a single dimensional positional embedding is learned for all variants of a given patch, which results in worse performance.

, for , is then the set of embeddings for the shift-variants associated with each patch of size . We assign indexes such that the first variant, , is the embedding of the non-shifted (identity) patch, where .

3.2 Local-global Attention

We first apply the local attention layer. This attention layer considers each patch independently. For each , we aggregate the embeddings of the local shifted variants, using an attention mechanism. This results in an updated embedding of size for each patch. Next, global attention layers are applied using a pyramidal structure.

Figure 2: Illustration of our local attention layer. Patch is shown in blue; variants of are shown in different colors. All patches are first encoded using a convolutional layer, which, together with the positional encoding, results in a dimensional embedding for each patch and variant . Given , a query embedding is generated as in Eq. 1. For each patch variant, a key embedding is generated as in Eq. 2. Both the query and key embeddings are used to create using Eq. 3

, a softmax probability vector indicating the importance of each patch variant. Finally,

is multiplied by the value embeddings (generated using Eq. 2 for all patch variants) as in Eq. 4, to generate the final output .

Local Attention An illustration of our local attention layer is shown in Fig. 2. Denote by the number of patches for each variant.

The first step is to calculate a query value for each non-shifted patch embedding :


where is a matrix constructed from all non-shifted patch embeddings for and is a learned query matrix. Next, we calculate keys and values for each patch embedding and variant:


That is, is a matrix constructed from all patch embeddings over all variants. is a learned key and value matrix. We note that, for each patch, while keys and values are computed from all variants, queries are obtained only from the non-shifted variant.

We now wish to construct an attention matrix . We consider each patch separately. That is, given we extract the patch-specific query vector . Similarly, we can view and

as tensors in

. For each patch, we consider the patch-specific key and value matrices . We now apply multi-head attention separately for each patch:


where is the dimension of each attention head. For each patch, this results in a pseudo-probability vector indicating the weight of each patch variant. The pooled value for patch is given by:


is then constructed by pooling all aggregated patch embeddings .

Following the local attention layer, we apply a feed-forward fully-connected network (FFN) with skip-connection [he2016deep] and LayerNorm (LN)  [ba2016layer]. The application of all of these components together is referred to as a local attention block.

Global Attention Following the local attention block, we are given tokens (for each patch) with an embedding of size . As in the standard setting of transformers [touvron2021training], a multi-head self-attention [vaswani2017attention] is applied to the tokens. The global attention block consists of a multi-head attention layer followed by a feed-forward fully-connected network (FFN) with skip-connection [he2016deep] and LayerNorm (LN)  [ba2016layer].

Pyramidal structure Recall that , so one can view the input as having a height of , a width of and channels. We consider patches at coarser scales, so we apply a sequence (pyramid) of global attention blocks, but with the output of each block downsampled before applying the next attention block.

The downsampling operation consists of the application of a convolutional layer with a kernel, stride and padding of

followed by a max-pooling operation with a

kernel, stride and padding of . Assuming the output channel of the convolutional layer is , downsampling results in tokens of dimension , on which the next global attention block is applied.

We continue this way for times (chosen as a hyperparameter), obtaining an output with lower and lower resolution. Lastly, we apply global average pooling over the spatial dimension, resulting in a final vector of dimension . This is followed by a linear layer that outputs

(number of classes) logits, on which standard softmax classification is applied.

3.3 Implementation Details

In Tab. 2 we provide the exact architectures for two different image resolutions - (used for ImageNet [deng2009imagenet]) and (used for CIFAR10 and CIFAR100 [krizhevsky2009learning] datasets). For resolution (CIFAR10/CIFAR100) images we consider three model variants provided - tiny, small and base, with increasing number of parameters. For (ImageNet) resolution, tiny and small variants are considered 111Note that while some previous works report results also for larger models, we were unable to allocate the resources needed for such experiments. Specifically, running our larger model on ImageNet would require more than 20 GPU months, using 32GB GPUs. This is not more demanding than previous work. e.g, [wang2021pyramid, ze2021swin, zhang2021aggregating]. However, these resources are not at our disposal at this time..

We use epochs for all experiments and use the same set of data augmentation and regularization strategies used by [touvron2021training] but exclude repeated augmentations [hoffer2020augment] and exponential moving average [polyak1992acceleration]. The initial learning rate is set to . We apply a linear warm-up of epochs for ImageNet and epochs for CIFAR10/CIFAR100. We scale the learning rate () according to the batch size () as: . We use the AdamW [kingma2014adam] optimizer with a cosine learning rate scheduler. The weight decay is set to and the maximal gradient norm is clipped to .

Input Resolution: 224 224 Input Resolution: 32 32
Output size Layer Tiny Small Output size Layer Tiny Small Base
Conv. emb.
Local Att.
Conv. emb.
Local Att.
}2*[ ]
Global Att.
}2*[ ]
}2*[ ]
Global Att.
}2*[ ]
}2*[ ]
}2*[ ]
Global Att.
}2*[ ]
}2*[ ]
Global Att.
}2*[ ]
}2*[ ]
}2*[ ]
Global Att.
}2*[ ]
Global Att.
}2*[ ]
}2*[ ]
}2*[ ]
# classes
Global Att.
Avg Pool
}2*[ ]
}2*[ ]
# classes
Global Att.
Avg Pool
}2*[ ]
}2*[ ]
}2*[ ]
Table 2: Architectures used for images with input resolution (ImageNet) and for input resolution (CIFAR10 and CIFAR100) for three model types: tiny, small and base. In step 1, we begin by applying a convolutional embedding (Conv emb.) to the input image. A kernel size , stride and padding are used and the output has channels. A separate convolutional embedding is performed for each of the image shifting variants (indicated beside the curly bracket) and positional encoding is added for each patch and variant embedding. Subsequently, we project the image (Proj.) into a three-dimensional tensor, which is then used as input for our local attention block (Local Att.). From step 2 onward, we begin by applying a global attention block (Glob Att.). We subsequently un-project the input and apply a downsampling operation (Down), as described in Sec. 3.2, where the resulting dimension is . denotes the number of heads used. A projection layer is then applied, converting the input into a two-dimensional input that can be used for subsequent global attention blocks. Each such step (Global Att. - Down. - Proj.) may be applied a number of times, as indicated by the number beside the curly bracket. In the last step, downsampling is replaced by average pooling (Avg. Pool) and projection is replaced by a linear layer (Linear), resulting in a logit vector. The final step (not shown) is a standard application of softmax, followed by a cross-entropy loss.

4 Experiments

We present multiple image classification experiments. Results are reported on three datasets: CIFAR10 [krizhevsky2009learning], CIFAR100 [krizhevsky2009learning], and ImageNet [deng2009imagenet]. Evaluation on the CIFAR10 and CIFAR100 datasets demonstrates the effectiveness of our method on low-resolution images, while evaluation on ImageNet demonstrates the effectiveness of our method on a higher resolution of .

We consider state-of-the-art convolution-based baselines as well as transformer-based baselines. Beyond DeiT [touvron2020deit], we also consider baselines that use a pyramidal architecture: PVT [wang2021pyramid], Swin [ze2021swin] and Nest [zhang2021aggregating]. CvT [wu2021cvt] is also considered for ImageNet (CIFAR10/CIFAR100 values not reported). Convolutional baselines include Pyramid-164-48 [han2017deep] and WRN28-10 [zagoruyko2016wide] for CIFAR10/CIFAR100 and ResNet50, ResNet101 [he2016deep], ResNetY-4GF and ResNetY-8GF [radosavovic2020designing] for ImageNet. We also perform an extensive ablation study for evaluating the contribution of each of our components.

CIFAR10/CIFAR100 As noted previously by [zhang2021aggregating], existing transformer-based methods usually perform poorly on such datasets. This is because self-attention methods are typically data-intensive, while in these datasets both the resolution and the number of samples are relatively small. In contrast, our method reintroduced the locality bias of CNNs by first applying local attention over neighbouring patches. By subsequently applying global attention in a pyramidal fashion, our method benefits from the introduction of correlations between distant patches gradually, at different scales. For CIFAR10 and CIFAR100 datasets, we use possible local variants: {(0,0), (1,0), (0,1), (-1,0), (0,-1), (2,0), (0,2), (-2,0), (0,-2), (1,1), (1,2), (1,-1), (-1,1), (-1,2), (-1,-1), (2,1), (2,2), (2,-1)}.

The results for CIFAR10 and CIFAR100 experiments are shown in Table 3. As can be seen, our method is superior to both CNN-based and Transformer-based baselines for each model size (tiny, small and base). Already at a small model size of M parameters our model achieves superior performance for both CIFAR10 and CIFAR100 on most baselines, which have a significantly larger number of parameters.

Type Method Params (M) Throughput CIFAR10(%) CIFAR100(%)
CNNs Pyramid-164-48 1.7 3715.9 95.97 80.70
WRN28-10 36.5 1510.8 95.83 80.75
Deit-T 5.3 1905.3 88.39 67.52
Deit-S 21.3 734.7 92.44 69.78
Deit-B 85.1 233.7 92.41 70.49
PVT-T 12.8 1478.1 90.51 69.62
PVT-S 24.1 707.2 92.34 69.79
PVT-B 60.9 315.1 85.05 43.78
Trans- Swin-T 27.5 1372.5 94.46 78.07
formers Swin-S 48.8 2399.2 94.17 77.01
Swin-B 86.7 868.3 94.55 78.45
Nest-T 6.2 627.9 96.04 78.69
Nest-S 23.4 1616.9 96.97 81.70
Nest-B 90.1 189.8 97.20 82.56
Our-T 10.2 1700.2 97.00 81.80
Our-S 36.2 807.6 97.64 83.66
Our-B 115.8 276.8 97.75 84.70
Table 3: Classification accuracy on CIFAR10 and CIFAR100. The number of parameters (in millions), and inference throughput (images per second) on a single GPU are shown.

ImageNet In Tab. 4, we consider a comparison of our method with baselines on the ImageNet dataset, which consists of a much larger number of higher-resolution () images. Since the image resolution is , a fixed partition of the image into patches of size results in patches.

As the attention mechanism is quadratic in the number of tokens , is either chosen to be large (ViT, DeiT), or a subset of patches is used as tokens (Nest, Swin).

In contrast, our method considers, for each patch, possible local variants: (0,0), (1,0), (2,0), (3,0), (0,1), (0,2), (0,3), (1,1), (2,2), (3,3). As a result, it can correctly capture details from its local neighborhood. Our local attention layer results in quadratic computation only in the number of variants. That is, the cost is . The cost of subsequent global attention layers is . Therefore, as long as , our method results in lower or equal computation cost, and it also considers local variants of each patch. As can be seen in Tab. 4, this results in a superior performance of our method in comparison to baselines for two different model sizes.

Method Prms GFLOPs Through Acc(%)
ResNet-50 25 3.9 1226.1 76.2
ResNet-101 45 7.9 753.6 77.4
RegNetY-4GF 21 4.0 1156.7 80.0
RegNetY-8GF 39 8.0 591.6 81.7
Deit-T 5.7 1.3 2536.5 72.2
PVT-T 13.2 1.9 - 75.1
Our-T 9.9 3.1 797.2 76.0
PVT-S 24.5 3.8 - 79.8
Deit-S 22.0 4.6 940.4 79.8
Swin-T 29.0 4.5 755.2 81.3
Nest-T 17.0 5.8 633.9 81.5
CVT-T 20.0 4.5 - 81.6
Our-S 22.0 7.6 298.9 82.2
Table 4: Classification accuracy on the ImageNet validation set. Number of parameters (in millions), GFLOPs, and inference throughput (Through) in images per second, on single GPU are shown. Throughput is given for baselines for which this value has been reported. Top part - CNN based methods, middle (resp. bottom) - methods comparable in number of parameters to Our-T (resp. Our-S).

4.1 Ablation study

0 - - 6.2 96.04
Ours w/o
A 0 - - 6.8 96.50
B 9 8.4 96.70
C 18 6.8 96.69
D 18 10.1 96.77
E 18 7.6 96.85
Full 18 10.2 97.00
Table 5: Ablation analysis for examining the contribution of: (1) shifting variants, (2) number of shifting variants, (3) shifting variants generated using learned convolutional filters, and, (4) different positional embeddings for each shifting variant.
I Swin block [ze2021swin] 96.3
II CvT [wu2021cvt] 96.0
Full 97.00
Table 6: Analysing the contribution of each component.

An ablation study is performed for our method. The results are summarized in Tab. 6. To demonstrate the superiority of our method, we consider the current state of the art as baseline [zhang2021aggregating]. First, we examine the importance of shifting, i.e. introducing shifting variants as input to the model and attending them with a local attention layer. By removing the use of shifting variants and local attention layers, our method is simplified to only applying hierarchical global attention, as described in Sec. 3.2. As can be seen (variant A), the tiny setting our method surpasses the baseline even without the use of shifting variants. Comparing the shiftless variant A to our full method (variant Full) demonstrates the performance gain achieved by adding shifting variants.

Next, we analyzed the effect of shifting the image with different number of variants (#Shifting). As can be seen (variants A to B to F), increasing the number of shifting variants improves the model’s performance. To reduce the number of shifts, we focus on shifts that are either on the horizontal or vertical lines, i.e, the 9 shifts: , or a subset of the shifts in which we include diagonal lines as well, i.e., the 18 shifts out of the possible 25: .

In Sec. 3.1 we describe the initial pre-processing step of embedding the patches and shifting variants, which is applied before the local attention block. We now compare two approaches to shifting: (i) constructing shifting variants as a pre-processing step, using image translation with reflection padding and (ii) passing the original image as input to the network, and applying convolutional layers as in Sec. 3.1 (as is done in our method). In Tab. 6, “Conv. Variations” indicates applying (ii) as opposed to (i). As can be seen, comparing variant C with variant E, and comparing variant D with our full method (variant Full), generating shifting variants using learned convolutional layers improves the model performance. Next, we checked whether it is necessary to add different positional embeddings to each shifting variant rather than simply learning one set of positional embeddings for all variants. As can be seen, comparing variant C with variant D and variant E with ur full method (variant Full), there is a trade-off between parameters and accuracy. Adding positional embeddings for each variant improves the performance, but the number of parameters increases. In our experiments we apply different positional embeddings to each shifting variant. As can be seen, our complete method, variant Full, outperforms the baseline. Furthermore, as can be seen in Tab. 3, our tiny model (Our-T) outperforms the small model of the baseline methods (Deit-S, PVT-S, Swin-S and Nest-S).

In Tab. 6, we perform an additional ablation on the necessity of our local attention block. We consider variant I, in which we replace our local attention block with Swin’s transformer block [ze2021swin]. While this variant achieves accuracy on CIFAR10, Swin-T achieves (see Tab. 3), demonstrating the importance of local attention for early global integration. In variant II, we replace our local attention block with the CvT local attention procedure [wu2021cvt], and make two modifications in our local attention block: (1) replace our patch embedding with Convolutional Token Embedding with overlapping patches (2) replace the linear projection of the keys, queries and values in the self attention blocks with Convolutional Projection. Our method outperforms both variants.

5 Limitations and Environmental Impact

We note that in order to train our models, as for other vision transformers, many GPU hours are required. This may result in a high environmental impact, especially in terms of carbon footprint.

6 Conclusions

While convolutional layers typically employ small strides, resulting in heavily overlapping patches, the token-based approach of recent transformer-based techniques has led to a grid view of the input image. This leads to the loss of the translation invariance that played a major role in the development of neural networks in computer vision, as well as in the study of biological vision models 


In this work, we reintroduce the locality bias of CNNs into a transformer-based architecture, already at the very first layer. This has the benefit of being able to model fine-detailed local correlations in addition to the coarse-detail global correlations for low-level features, which transformers model well. We employ two types of attention layers. The local attention layer models the correlation of a patch with its local shifting variants, thus modeling fine-grained correlations. The global attention layer, applied in a pyramidal manner, with decreasing input resolution, models long-range correlations. The use of local-global attention layers as opposed to a single attention layer is crucial for introducing the desired locality bias and capturing the correlation between neighbouring local shifts of each patch. This is especially useful for smaller datasets, with low-resolution images, such as CIFAR10/CIFAR100. Nevertheless, our method also scales well to large images, such as those of ImageNet.

We demonstrate the superiority of our method on both small-resolution inputs of (CIFAR10/CIFAR100) and larger-resolution inputs of (ImageNet). Our method achieves superior accuracy to other convolutional and transformer-based state-of-the-art methods, with a comparable number of parameters.


This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC CoG 725974).