LESA
Locally Enhanced SelfAttention: Rethinking SelfAttention as Local and Context Terms
view repo
SelfAttention has become prevalent in computer vision models. Inspired by fully connected Conditional Random Fields (CRFs), we decompose it into local and context terms. They correspond to the unary and binary terms in CRF and are implemented by attention mechanisms with projection matrices. We observe that the unary terms only make small contributions to the outputs, and meanwhile standard CNNs that rely solely on the unary terms achieve great performances on a variety of tasks. Therefore, we propose Locally Enhanced SelfAttention (LESA), which enhances the unary term by incorporating it with convolutions, and utilizes a fusion module to dynamically couple the unary and binary operations. In our experiments, we replace the selfattention modules with LESA. The results on ImageNet and COCO show the superiority of LESA over convolution and selfattention baselines for the tasks of image recognition, object detection, and instance segmentation. The code is made publicly available.
READ FULL TEXT VIEW PDFLocally Enhanced SelfAttention: Rethinking SelfAttention as Local and Context Terms
Locally Enhanced SelfAttention: Rethinking SelfAttention as Local and Context Terms
Locally Enhanced SelfAttention: Rethinking SelfAttention as Local and Context Terms
SelfAttention has made a great influence in the computer vision community recently. It led to the emergence of fully attentional models
ramachandran2019stand ; wang2020axial and transformers vaswani2017attention ; dosovitskiy2020image ; carion2020end. Importantly, they show superior performances over traditional convolution neural networks on a variety of tasks including classification, object detection, segmentation, and image completion
wang2020max ; srinivas2021bottleneck ; liu2021swin ; wan2021high .Despite its remarkable achievement, the understanding of selfattention remains limited. One of its advantages is overcoming the limitation of spatial distances on dependency modelling. Originating from natural language processing, attention models the dependencies without regard to the distances among the words in the sequence, compared to LSTM
hochreiter1997long and gated RNN chung2014empirical . Being applied to vision models, attention aggregates the information globally among the pixels or patches dosovitskiy2020image ; wang2020axial. Similarly, compared to traditional convolutions, the features extracted by attention are no longer constrained by a local neighborhood.
We argue that global aggregations in selfattention also bring problems because the aggregated features cannot clearly distinguish local and contextual cues. We study this from the perspective of Conditional Random Fields (CRFs) and decompose it into local and context terms. The unary (local) and binary (context) terms are based on the same building blocks of queries, keys, and values, and are calculated using the same projection matrices. We hypothesis that using the same building blocks for the local and context terms will cause problems, which relates to the weaknesses of projections in selfattention pointed out by Dong et. al. dong2021attention . They theoretically prove that the output of consecutive selfattention layers converges doubly exponentially to a rank matrix and verify this degeneration in transformers empirically. They also claim that skip connections can partially resolve the rankcollapse problem. In our CRF analysis, the skip connections create the simplest local term which amounts to the identity mapping. Skip connections alleviate the problem but we argue a local term with a stronger representation capacity needs to be designed.
In this paper, we enhance the unary term by integrating it with convolutions and propose Locally Enhanced SelfAttention (LESA), which is visualized in Fig. 1. To analyze selfattention from the perspective of CRFs, let be the input and the output of one layer of selfattention. Both of them are twodimensional grid of nodes. At spatial location , the node is connected to all the nodes of the input. The binary term involves the computation on the edges while the unary term the computation on the edge . Intuitively, these two terms indicate the activation by looking at itself (local) and the others (context). Through ablation study in Tab. 1, we find that the unary term is important for the performance but only contributing to the output less than computed by the softmax operation in attention. Without the unary term, the feature extraction at entirely depends on interactions and losses the precise information of that pixel. The structure of the selfattention does not facilitate this unary operation. To address this issue, we enhance the unary term to where indicate the pixels in neighborhood, and implement it as a grouped convolution followed by a projection layer.
To couple the unary and binary terms, we propose a dynamic fusion mechanism. These simplest static ways would be to assign equal weights to them or by setting their weights to be hyperparameters. By contrast, we enable the model to allocate the weights on demand. Specifically, for each layer with the binary term, we multiply the binary term elementwisely by . depends on the input and dynamically controls the weights of the binary terms to the unary terms for different layers , spatial locations , and feature channels .
We study the performance of LESA for image classification, object detection, and instance segmentation. We replace the spatial convolutions with LESAs in the last two stages of ResNet he2016deep and its larger variant WRN zagoruyko2016wide . Then, we use them equipped with FPN lin2017feature as the backbones in MaskRCNN he2017mask to evaluate their performance for object detection and instance segmentation. The challenging largescale datasets ILSVRC2012 russakovsky2015imagenet and COCO lin2014microsoft are used to train and evaluate the models. The experiments demonstrate the superiority of LESA over the convolution and selfattention baselines.
To summarize, the main contributions of this work are:
[leftmargin=*]
Analyzing selfattention from the perspective of fully connected CRFs, we decompose it into a pair of local (unary) and context (binary) terms. We observe the unary terms make small contributions to the outputs. Inspired by the standard CNNs’ focus on the local cues, we propose to enhance the unary term by incorporating it with convolutions.
We propose a dynamic fusion module to couple the unary and binary terms adaptively. Their relative weights are adjusted as needed, depending on specific inputs, spatial locations, and feature channels.
We implement Locally Enhanced SelfAttention (LESA) for vision tasks. Our experiments on the challenging datasets, ImageNet and COCO, demonstrate that the LESA is superior to the convolution and selfattention baselines. Especially for object detection and instance segmentation where local features are particularly important, LESA achieves significant improvements.
We decompose the selfattention into local and context terms. Let be the input, where is the feature channels and are the height and width in spatial dimensions. In this case, each pixel is connected with all the other pixels in the computation. We consider the alltoall selfattention since it has been adopted as a building layer and shows superior performance wang2020axial ; srinivas2021bottleneck . Specifically, we can write the formula of selfattention as:
(1) 
where and represent the spatial locations of the pixel and specifies the layer index. are the query, key and value which are obtained by applying three different convolutions on . and are learnable parameters, where are intermediate and output channels. is the relative position embedding, and for simplicity we will use the notation . This formula shows the activation integrates the information conveyed by all the pixels . To comprehend this operation, we decompose the information flow and reformulate the equation as the combination of local term and context term:
(2)  
(3) 
For the spatial location , the first local term computes activation by looking at itself, while the second context term computes activation by looking at others. The softmax generates the weights of contribution. Through this decomposition, we can interpret selfattention as a doublesource feature extractor, which consists of a pair of unary and binary terms. Unary and binary terms are computed by the queries, keys, and values at different spatial locations with shared projection matrices . Consequently, the outputs entangle the local and context features.
We perform an ablation study to investigate the contribution of these two terms. Specifically, we take a ResNet50 he2016deep and replace the convolution layers of its last two stages with selfattention. The model is trained from scratch on ImageNet russakovsky2015imagenet . During inference, we track the softmax operations of all selfattention layers and obtain the weights for the unary and binary terms, and whose summation equals to . By averaging them across all layers, we obtain the weighted contributions of these two terms. Then, we ablate the unary in the evaluation phase. The results are shown in Tab. 1. We can observe that selfattention is mainly contributed by the binary operations, but the unary term is also important. Although the weights of unary terms only take less than , the removal of which causes drop of accuracy or relative increase on the error rate. When analyzing the selfattention by this decomposition, unary term plays a significant role, but most of the computations and focuses are given to binary operations.
Method  Top1 Error (%)  Weight Percentage (%) 

selfattention  
selfattention  unary term 
Local and context terms have been long used in formulating the graphical models for vision tasks, such as image denoising, segmentation, and surface reconstruction prince2012computer . The fully connected Conditional Random Fields (CRFs) have been introduced on top of the deep networks for semantic segmentation chen2017deeplab . It aims at coupling the recognition capacity and localization accuracy, and achieves excellent performance. For a grid of pixels in the form of a graph , the energy to be minimized for the CRF is defined by:
(4) 
where indicate the different vertices in . The unary term is where
is the probability of assigning
the ground truth label by the model. The binary term is where and are the contents and spatial positions.is the probability density function to measure the similarity of two values, which can be chosen as Gaussian.
The unary term is utilized for recognition while the binary term for spatial and content interactions. Inspired by these and our decomposition analysis, we propose Locally Enhanced SelfAttention (LESA). It contains a unary term incorporated with convolutions, and a binary term for feature interactions. Locally Enhanced SelfAttention is defined by
(5)  
(6) 
where is the weight that will be discussed in Sec. 2.3. is the local term obtained by applying two consecutive convolutions. is a learnable matrix where and represent the spatial extent and group number of the convolution. is a learnable projection matrix representing convolution. By this design, the multihead mechanism is integrated. is the unary activation at spatial location . This formulation of LESA also enables us to change to deformable convolution dai2017deformable for the tasks of object detection and instance segmentation as presented in Sec. 3.2. As discussed in Sec. 2.1, selfattention focuses on the binary operations. We use it as the context term to model the feature interactions with relative spatial relationships among all possible pairs of pixels.
Adding the unary and binary terms is a static way of merging the two terms with equal weights. A more flexible strategy is to allocate the weights on demand under different circumstances. For example, in object detection, the locality of pixel dependencies is more important than the context when detecting multiple small objects in an image. We achieve a dynamic control by multiplying the binary term by and adaptively adjusting the relative weights of the two terms, which is shown in Equ. (5). Specifically, we can write the formula of as:
(7) 
where and corresponds to one spatial location.
is a function. Sigmoid operation is performed elementwisely on the logits given by
, making range from to . Regarding, we design it as a threelayer perceptron and adopt the preactivation design
he2016identity . Concretely, together with sigmoid we can represent the pipeline aswhere BN is batch normalization layer
ioffe2015batch , and , are two fully connected layers.In our design, is dependent on the contents of the unary and binary terms and controls their relative weights at different spatial locations and in different feature channels. It is our principal way to fuse the unary and binary terms.
Models  Operations  Params (M)  Accuracy (%)  Weights (%)  

Top1  Top5  Unary  Binary  
ResNet50  Convolution  
ResNet50  SelfAttention  
ResNet50  LESA  
WRN50  Convolution  
WRN50  SelfAttention  
WRN50  LESA 
Backbone  Operations  Epochs  

ResNet50  Convolution  
ResNet50  SelfAttention  
ResNet50  LESA  
WRN50  Convolution  
WRN50  SelfAttention  
WRN50  LESA  
ResNet50  Deformable Conv.  
ResNet50  LESA  
WRN50  Deformable Conv.  
WRN50  LESA  
WRN50  HTC Conv.  
WRN50  LESA  
WRN50  HTC Conv_H  
WRN50  LESA_H 
Backbone  Operations  Epochs  

ResNet50  Convolution  
ResNet50  SelfAttention  
ResNet50  LESA  
WRN50  Convolution  
WRN50  SelfAttention  
WRN50  LESA  
ResNet50  Deformable Conv.  
ResNet50  LESA  
WRN50  Deformable Conv.  
WRN50  LESA  
WRN50  HTC Conv.  
WRN50  LESA  
WRN50  HTC Conv_H  
WRN50  LESA_H 
Backbone  Operations  Epochs  

ResNet50  LESA  
WRN50  LESA  
ResNet50  LESA  
WRN50  LESA  
WRN50  HTC + LESA  
WRN50  HTC + LESA_H 
Backbone  Operations  Epochs  

ResNet50  LESA  
WRN50  LESA  
ResNet50  LESA  
WRN50  LESA  
WRN50  HTC + LESA  
WRN50  HTC + LESA_H 
Settings We perform image classification experiments on ILSVRC2012 russakovsky2015imagenet which is a popular subset of the ImageNet database deng2009imagenet . There are images in the training set and images in the validation set. In total, it includes object classes, each of which has approximately the same number of training images and strictly the same number of testing images.
ResNet he2016deep , a family of canonical models or backbones for vision tasks, and its larger variant WRN zagoruyko2016wide are used to study LESA. There are stages in ResNet and each one is formed by a series of bottleneck blocks. ResNet50 can be represented by the bottleneck numbers . We replace the conv in the bottleneck with the selfattention and LESA. The kernel channels of these conv in WRN are twice as large as those in ResNet.
We perform the replacement in the last two stages, which is enough to show the advantages of LESA. For convolution baselines, we use Torchvisoin official models paszke2019pytorch . For selfattention baselines and our LESA, we set head number
for both of them and train the models from scratch. We set the stride of the last stage to be
following srinivas2021bottleneck . We keep the first bottleneck in stage 3 unchanged which has the stride convolution. We employ a canonical training scheme with linear warmup and training epochs with a batch size . Following ramachandran2019stand ; wang2020axial, we employ SGD with Nesterov momentum
nesterov1983method ; sutskever2013importance and cosine annealing learning rate initialized as . The experiments are performed on NVIDIA TITAN XP graphics cards.Results The results are summarized in Tab. 2. For both the top and top accuracy, LESA surpasses the convolution and selfattention baselines. Our dynamic fusion module controls the binary term using in Equ. (7), and thus the weights for the unary and binary terms are and , respectively. As is dependent on the inputs, spatial locations, and feature channels, we average the weights across them in our records. In selfattention, the weights are calculated by softmax operations as used in the ablation study of Sec. 2.1. It is observed that the weight distribution in selfattention are imbalanced. The unary term only has a weight percentage less than , more than times smaller than the binary term’s. While for LESA, their weight percentages are about and , respectively. In the tasks of object detection where local cues are particularly important, LESA shows better improvement, which is shown in Sec. 3.2
Settings We perform object detection experiments on COCO dataset lin2014microsoft and use the 2017 dataset splits. There are images in the training set and images in the validation set. In total there are object categories and on average each image contains categories and instances.
The widely used MaskRCNN he2017mask and HTC chen2019hybrid with the backbones equipped with FPN lin2017feature are used to study LESA for object detection and instance segmentation. We use mmdetection chen2019mmdetection as the codebase. The ImageNet pretrained checkpoints are utilized to initialize the backbones. There are stages in the ResNetFPN and the output strides are . We replace the spatial convolutions in the rd and th stages. The images are resized to and in the experiments. Since the image size in classification is , we initialize new position embedding layers used in shaw2018self ; ramachandran2019stand . For training, we employ the and schedules. The total epochs and epochs after which the learning rate is multiplied by are and , respectively. For the HTC framework, we employ multiscale training for both the baseline and our method: with probability that both sides of the image are resized to a scale uniformly chosen from to , and with another probability to a scale that is uniformly sampled from to . MaskRCNN does not use multiscale training.
We also study adopting the deformable unary terms in LESA. Specifically, we replace in Equ. (6) to deformable convolutions dai2017deformable . We set the group number of offsets as . Following the standard setting qiao2020detectors , the convolutions in the nd stage in both the baselines and our models are also replaced with deformable convolutions. Our experiments with MaskRCNN framework are performed on NVIDIA TITAN XP graphics cards and those with HTC framework on TITIAN RTX graphics cards.
Results The results are summarized in Tab. 3, 4, 5, and 6. We use the same testing pipeline for val2017 and testdev2017. LESA provides the best bounding box mAP and mask mAP for all the small, medium, and large objects compared with convolution, selfattention, and deformable convolution baselines in all scenarios.



Settings In this section, we perform ablation studies to investigate the unary term and the dynamic fusion module. The static LESA is adding the unary and binary terms with fixed equal weights without regard to the inputs, spatial locations, and feature channels. Besides, we use the group convolution as the unary term with more parameters and representational power for static LESA. Specifically, we take the pretrained ResNet50 he2016deep and replace the spatial convolutions in the last stage. During training for image classification, we freeze the first three stages and adjust the training length to 2 warmup and 35 training epochs. The other settings follow Sec. 3.2.
Results The results are summarized in Tab. 7. Both static LESA and LESA are benefited from the presence of the unary term and outperform other baselines in object detection and instance segmentation. To detect small and large objects, LESA behaves better than the static one. In classification, the advantage of dynamic fusion is more clear. Both the unary term and dynamic fusion mechanism are important parts of LESA.
Convolutional Neural Networks (CNNs) have become the dominant models in computer vision in the last decade. AlexNet
krizhevsky2012imagenet shows considerable improvement over the models based on handcrafted features perronnin2010improving ; sanchez2011high , and opens the door to the age of deep neural networks. Lots of efforts have been made to increase the width and depth and to improve the architecture and efficiency of CNNs in the pursuit of performance. They include the designs of VGG simonyan2014very , GoogleNet szegedy2015going , ResNet he2016deep , WRN zagoruyko2016wide , ResNeXt xie2017aggregated , DenseNet huang2017densely , SENet hu2018squeeze , MobileNet howard2017mobilenets , EfficientNet tan2019efficientnet , etc. Through this process, the convolution layers are also being developed, leading to the grouped convolutions xie2017aggregated , depthwise separable convolutions chollet2017xception , deformable convolutions dai2017deformable ; zhu2019deformable , atrous convolutions chen2014semantic ; papandreou2015modeling and switchable atrous convolutions qiao2020detectors ; chen2020scaling .The impact of selfattention on vision community is becoming greater. Selfattention is originally proposed in approaches of natural machine translation bahdanau2014neural . It enables the encoderdecoder model to adaptively find the useful information according to contents from a variable length sentence. In computer vision, nonlocal neural networks wang2018non show that selfattention is an instantiation of nonlocal means buades2005non , and use it to capture longrange dependencies to augment CNNs for tasks including video classification and object detection. Net chen20182 employs a variant of nonlocal means and Attention Augmentation bello2019attention augments the convolution features with attention features, both of which show performance improvement on image classification. Recently, fully attentional methods ramachandran2019stand ; hu2019local ; zhao2020exploring which replaces all the spatial convolutions with selfattention in the deep networks are proposed with stronger performances than CNNs. Axial attention wang2020axial factorizes the 2D selfattention into two 1D consecutive selfattentions which reduces the computation complexity and enables the selfattention layer to have a global kernel. Selfattention also promotes the generation of transformers vaswani2017attention ; dosovitskiy2020image ; touvron2020training ; carion2020end ; wu2020lite ; liu2021swin . BotNet srinivas2021bottleneck relates the transformer block with the fully attentional version of bottleneck block in ResNet he2016deep .
From the perspective of fully connected Conditional Random Fields (CRFs), we decouple the selfattention into the local and context terms. They are the unary and binary terms that are calculated by the queries, keys and values in the attention mechanism. However, there lacks distinction between the local and context cues as they are obtained by using the same set of projection matrices. In addition, we observe the contribution of the local terms is very small which is controlled by the softmax operation. By contrast, the standard Convolutional Neural Networks (CNNs) show excellent performances in various vision tasks and rely solely on the local terms.
In this work, we propose Locally Enhanced SelfAttention (LESA). First, we enhance the unary term by incorporating it with convolutions. The multihead mechanism is realized by using grouped convolution followed by the projection layer. Second, we propose a dynamic fusion module to combine the unary and binary terms. Their relative weights are adaptively changed with specific inputs, spatial locations, and feature channels. We replace the selfattention with LESA and perform the experiments on the challenging largescale datasets, ImageNet and COCO. All the results demonstrate the superiority of LESA over the convolution and selfattention baselines in the tasks of image classification, object detection, and instance segmentation.
LESA shares a common limitation with selfattention, which is the large memory consumption. This is due to the large dimensions of the similarity matrix which is computed by the queries and keys and where the softmax operation is applied. Our future works include designing a LESA that consumes small memory but still has the great power of capturing the context information. This will also address the common memory issue in other selfattention models. Like convolution and selfattention, LESA belongs to the type of technical tools that does not introduce any additional foreseeable societal problems. It helps improve the vision models and there is no specific new risk.
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)
, volume 2, pages 60–65. IEEE, 2005.Xception: Deep learning with depthwise separable convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.International conference on machine learning
, pages 448–456. PMLR, 2015.