Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms

07/12/2021 ∙ by Chenglin Yang, et al. ∙ Johns Hopkins University 6

Self-Attention has become prevalent in computer vision models. Inspired by fully connected Conditional Random Fields (CRFs), we decompose it into local and context terms. They correspond to the unary and binary terms in CRF and are implemented by attention mechanisms with projection matrices. We observe that the unary terms only make small contributions to the outputs, and meanwhile standard CNNs that rely solely on the unary terms achieve great performances on a variety of tasks. Therefore, we propose Locally Enhanced Self-Attention (LESA), which enhances the unary term by incorporating it with convolutions, and utilizes a fusion module to dynamically couple the unary and binary operations. In our experiments, we replace the self-attention modules with LESA. The results on ImageNet and COCO show the superiority of LESA over convolution and self-attention baselines for the tasks of image recognition, object detection, and instance segmentation. The code is made publicly available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

LESA

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms


view repo

LESA_classification

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms


view repo

LESA_detection

Locally Enhanced Self-Attention: Rethinking Self-Attention as Local and Context Terms


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-Attention has made a great influence in the computer vision community recently. It led to the emergence of fully attentional models 

ramachandran2019stand ; wang2020axial and transformers vaswani2017attention ; dosovitskiy2020image ; carion2020end

. Importantly, they show superior performances over traditional convolution neural networks on a variety of tasks including classification, object detection, segmentation, and image completion 

wang2020max ; srinivas2021bottleneck ; liu2021swin ; wan2021high .

Despite its remarkable achievement, the understanding of self-attention remains limited. One of its advantages is overcoming the limitation of spatial distances on dependency modelling. Originating from natural language processing, attention models the dependencies without regard to the distances among the words in the sequence, compared to LSTM 

hochreiter1997long and gated RNN chung2014empirical . Being applied to vision models, attention aggregates the information globally among the pixels or patches dosovitskiy2020image ; wang2020axial

. Similarly, compared to traditional convolutions, the features extracted by attention are no longer constrained by a local neighborhood.

We argue that global aggregations in self-attention also bring problems because the aggregated features cannot clearly distinguish local and contextual cues. We study this from the perspective of Conditional Random Fields (CRFs) and decompose it into local and context terms. The unary (local) and binary (context) terms are based on the same building blocks of queries, keys, and values, and are calculated using the same projection matrices. We hypothesis that using the same building blocks for the local and context terms will cause problems, which relates to the weaknesses of projections in self-attention pointed out by Dong et. al. dong2021attention . They theoretically prove that the output of consecutive self-attention layers converges doubly exponentially to a rank- matrix and verify this degeneration in transformers empirically. They also claim that skip connections can partially resolve the rank-collapse problem. In our CRF analysis, the skip connections create the simplest local term which amounts to the identity mapping. Skip connections alleviate the problem but we argue a local term with a stronger representation capacity needs to be designed.

In this paper, we enhance the unary term by integrating it with convolutions and propose Locally Enhanced Self-Attention (LESA), which is visualized in Fig. 1. To analyze self-attention from the perspective of CRFs, let be the input and the output of one layer of self-attention. Both of them are two-dimensional grid of nodes. At spatial location , the node is connected to all the nodes of the input. The binary term involves the computation on the edges while the unary term the computation on the edge . Intuitively, these two terms indicate the activation by looking at itself (local) and the others (context). Through ablation study in Tab. 1, we find that the unary term is important for the performance but only contributing to the output less than computed by the softmax operation in attention. Without the unary term, the feature extraction at entirely depends on interactions and losses the precise information of that pixel. The structure of the self-attention does not facilitate this unary operation. To address this issue, we enhance the unary term to where indicate the pixels in neighborhood, and implement it as a grouped convolution followed by a projection layer.

To couple the unary and binary terms, we propose a dynamic fusion mechanism. These simplest static ways would be to assign equal weights to them or by setting their weights to be hyper-parameters. By contrast, we enable the model to allocate the weights on demand. Specifically, for each layer with the binary term, we multiply the binary term element-wisely by . depends on the input and dynamically controls the weights of the binary terms to the unary terms for different layers , spatial locations , and feature channels .

We study the performance of LESA for image classification, object detection, and instance segmentation. We replace the spatial convolutions with LESAs in the last two stages of ResNet he2016deep and its larger variant WRN zagoruyko2016wide . Then, we use them equipped with FPN lin2017feature as the backbones in Mask-RCNN he2017mask to evaluate their performance for object detection and instance segmentation. The challenging large-scale datasets ILSVRC2012 russakovsky2015imagenet and COCO lin2014microsoft are used to train and evaluate the models. The experiments demonstrate the superiority of LESA over the convolution and self-attention baselines.

To summarize, the main contributions of this work are:

  • [leftmargin=*]

  • Analyzing self-attention from the perspective of fully connected CRFs, we decompose it into a pair of local (unary) and context (binary) terms. We observe the unary terms make small contributions to the outputs. Inspired by the standard CNNs’ focus on the local cues, we propose to enhance the unary term by incorporating it with convolutions.

  • We propose a dynamic fusion module to couple the unary and binary terms adaptively. Their relative weights are adjusted as needed, depending on specific inputs, spatial locations, and feature channels.

  • We implement Locally Enhanced Self-Attention (LESA) for vision tasks. Our experiments on the challenging datasets, ImageNet and COCO, demonstrate that the LESA is superior to the convolution and self-attention baselines. Especially for object detection and instance segmentation where local features are particularly important, LESA achieves significant improvements.

Figure 1: Visualizing the proposed Locally Enhanced Self-Attention (LESA) at one spatial location. The left part is the visualization in the spatial dimensions while the right part the operation pipeline. In both figures, the blue connectors with double arrowheads represent the binary operations while the red connectors with single arrowhead the unary operations. The nodes with black edges represent the input pixels, and some of them are omitted for simplicity. The node filled with gray represents the pixel in the current spatial location. Analyzing the self-attention from the perspective of fully connected CRFs, we find that the unary term is important for the performance but only makes small contributions to the outputs. Therefore, we propose to enhance it by integrating it with convolutions, as shown by the dashed red lines. To couple local and binary terms, we design a fusion module to dynamically adjust their relative weights depending on the inputs, spatial locations, and feature channels.

2 LESA: Locally Enhanced Self Attention

2.1 Decomposition of Self-Attention

We decompose the self-attention into local and context terms. Let be the input, where is the feature channels and are the height and width in spatial dimensions. In this case, each pixel is connected with all the other pixels in the computation. We consider the all-to-all self-attention since it has been adopted as a building layer and shows superior performance wang2020axial ; srinivas2021bottleneck . Specifically, we can write the formula of self-attention as:

(1)

where and represent the spatial locations of the pixel and specifies the layer index. are the query, key and value which are obtained by applying three different convolutions on . and are learnable parameters, where are intermediate and output channels. is the relative position embedding, and for simplicity we will use the notation . This formula shows the activation integrates the information conveyed by all the pixels . To comprehend this operation, we decompose the information flow and reformulate the equation as the combination of local term and context term:

(2)
(3)

For the spatial location , the first local term computes activation by looking at itself, while the second context term computes activation by looking at others. The softmax generates the weights of contribution. Through this decomposition, we can interpret self-attention as a double-source feature extractor, which consists of a pair of unary and binary terms. Unary and binary terms are computed by the queries, keys, and values at different spatial locations with shared projection matrices . Consequently, the outputs entangle the local and context features.

We perform an ablation study to investigate the contribution of these two terms. Specifically, we take a ResNet50 he2016deep and replace the convolution layers of its last two stages with self-attention. The model is trained from scratch on ImageNet russakovsky2015imagenet . During inference, we track the softmax operations of all self-attention layers and obtain the weights for the unary and binary terms, and whose summation equals to . By averaging them across all layers, we obtain the weighted contributions of these two terms. Then, we ablate the unary in the evaluation phase. The results are shown in Tab. 1. We can observe that self-attention is mainly contributed by the binary operations, but the unary term is also important. Although the weights of unary terms only take less than , the removal of which causes drop of accuracy or relative increase on the error rate. When analyzing the self-attention by this decomposition, unary term plays a significant role, but most of the computations and focuses are given to binary operations.

Method Top-1 Error (%) Weight Percentage (%)
self-attention
self-attention - unary term
Table 1: Contributions of the unary term in self-attention. We replace the spatial convolutions in the rd and th stages of ResNet50 he2016deep with self-attention. By tracking the softmax operations, we record the weights of the unary and binary terms and in Equ. (2). They add up to at each layer. The weight percentage is the average across all the layers. We observe that the unary term is important. The removal of unary terms whose weight percentage is less than increases the error rate by (or relative increase).

2.2 Locally Enhanced Self-Attention

Local and context terms have been long used in formulating the graphical models for vision tasks, such as image denoising, segmentation, and surface reconstruction prince2012computer . The fully connected Conditional Random Fields (CRFs) have been introduced on top of the deep networks for semantic segmentation chen2017deeplab . It aims at coupling the recognition capacity and localization accuracy, and achieves excellent performance. For a grid of pixels in the form of a graph , the energy to be minimized for the CRF is defined by:

(4)

where indicate the different vertices in . The unary term is where

is the probability of assigning

the ground truth label by the model. The binary term is where and are the contents and spatial positions.

is the probability density function to measure the similarity of two values, which can be chosen as Gaussian.

The unary term is utilized for recognition while the binary term for spatial and content interactions. Inspired by these and our decomposition analysis, we propose Locally Enhanced Self-Attention (LESA). It contains a unary term incorporated with convolutions, and a binary term for feature interactions. Locally Enhanced Self-Attention is defined by

(5)
(6)

where is the weight that will be discussed in Sec. 2.3. is the local term obtained by applying two consecutive convolutions. is a learnable matrix where and represent the spatial extent and group number of the convolution. is a learnable projection matrix representing convolution. By this design, the multi-head mechanism is integrated. is the unary activation at spatial location . This formulation of LESA also enables us to change to deformable convolution dai2017deformable for the tasks of object detection and instance segmentation as presented in Sec. 3.2. As discussed in Sec. 2.1, self-attention focuses on the binary operations. We use it as the context term to model the feature interactions with relative spatial relationships among all possible pairs of pixels.

2.3 Dynamic Fusion of the Unary and Binary Terms

Adding the unary and binary terms is a static way of merging the two terms with equal weights. A more flexible strategy is to allocate the weights on demand under different circumstances. For example, in object detection, the locality of pixel dependencies is more important than the context when detecting multiple small objects in an image. We achieve a dynamic control by multiplying the binary term by and adaptively adjusting the relative weights of the two terms, which is shown in Equ. (5). Specifically, we can write the formula of as:

(7)

where and corresponds to one spatial location.

is a function. Sigmoid operation is performed element-wisely on the logits given by

, making range from to . Regarding

, we design it as a three-layer perceptron and adopt the pre-activation design 

he2016identity . Concretely, together with sigmoid we can represent the pipeline as

where BN is batch normalization layer 

ioffe2015batch , and , are two fully connected layers.

In our design, is dependent on the contents of the unary and binary terms and controls their relative weights at different spatial locations and in different feature channels. It is our principal way to fuse the unary and binary terms.

3 Experiments

3.1 ImageNet Classification

Models Operations Params (M) Accuracy (%) Weights (%)
Top-1 Top-5 Unary Binary
ResNet50 Convolution
ResNet50 Self-Attention
ResNet50 LESA
WRN50 Convolution
WRN50 Self-Attention
WRN50 LESA
Table 2: ImageNet classification results. We replace the spatial convolutions in the rd and th stages of ResNet he2016deep and WRN zagoruyko2016wide with self-attention and LESA. We employ the position embedding method used in wang2020axial . For self-attention, the weight percentages are obtained by tracking softmax operations which is the same as Tab. 1. For LESA, the in Equ. (7), the output of the dynamic fusion module is tracked. and are the weights for the unary and binary terms in LESA, which are then averaged across the spatial locations, feature channels, and layers as the final weight percentage. We can observe that compared with self-attention, LESA achieves a more balanced utilization of both the unary and binary terms. In both the top-1 and top-5 accuracy, LESA outperforms other baselines. LESA has a more significant improvement in object detection where the local cues are particularly important.
Backbone Operations Epochs
ResNet50 Convolution
ResNet50 Self-Attention
ResNet50 LESA
WRN50 Convolution
WRN50 Self-Attention
WRN50 LESA
ResNet50 Deformable Conv.
ResNet50 LESA
WRN50 Deformable Conv.
WRN50 LESA
WRN50 HTC Conv.
WRN50 LESA
WRN50 HTC Conv_H
WRN50 LESA_H
Table 3: COCO object detection on val2017. We use Mask-RCNN he2017mask and HTC chen2019hybrid frameworks and employ FPN lin2017feature on the backbones ResNet he2016deep and WRN zagoruyko2016wide . We adopt two standard training schedules that have and epochs. The learning rate is adjusted times smaller after epochs and epochs, respectively. The images are resized to by default, and the postfix _H (higher resolution) indicates that they are resized to . We can observe that LESA outperforms the convolution, self-attention, and deformable convolution baselines in all the experiments.
Backbone Operations Epochs
ResNet50 Convolution
ResNet50 Self-Attention
ResNet50 LESA
WRN50 Convolution
WRN50 Self-Attention
WRN50 LESA
ResNet50 Deformable Conv.
ResNet50 LESA
WRN50 Deformable Conv.
WRN50 LESA
WRN50 HTC Conv.
WRN50 LESA
WRN50 HTC Conv_H
WRN50 LESA_H
Table 4: COCO instance segmentation on val2017.
Backbone Operations Epochs
ResNet50 LESA
WRN50 LESA
ResNet50 LESA
WRN50 LESA
WRN50 HTC + LESA
WRN50 HTC + LESA_H
Table 5: COCO object detection on test-dev2017 for the models in Tab. 3.
Backbone Operations Epochs
ResNet50 LESA
WRN50 LESA
ResNet50 LESA
WRN50 LESA
WRN50 HTC + LESA
WRN50 HTC + LESA_H
Table 6: COCO instance segmentation on test-dev2017 for the models in Tab. 4

Settings We perform image classification experiments on ILSVRC2012 russakovsky2015imagenet which is a popular subset of the ImageNet database deng2009imagenet . There are images in the training set and images in the validation set. In total, it includes object classes, each of which has approximately the same number of training images and strictly the same number of testing images.

ResNet he2016deep , a family of canonical models or backbones for vision tasks, and its larger variant WRN zagoruyko2016wide are used to study LESA. There are stages in ResNet and each one is formed by a series of bottleneck blocks. ResNet50 can be represented by the bottleneck numbers . We replace the conv in the bottleneck with the self-attention and LESA. The kernel channels of these conv in WRN are twice as large as those in ResNet.

We perform the replacement in the last two stages, which is enough to show the advantages of LESA. For convolution baselines, we use Torchvisoin official models paszke2019pytorch . For self-attention baselines and our LESA, we set head number

for both of them and train the models from scratch. We set the stride of the last stage to be

following srinivas2021bottleneck . We keep the first bottleneck in stage 3 unchanged which has the stride convolution. We employ a canonical training scheme with linear warm-up and training epochs with a batch size . Following  ramachandran2019stand ; wang2020axial

, we employ SGD with Nesterov momentum 

nesterov1983method ; sutskever2013importance and cosine annealing learning rate initialized as . The experiments are performed on NVIDIA TITAN XP graphics cards.

Results The results are summarized in Tab. 2. For both the top- and top- accuracy, LESA surpasses the convolution and self-attention baselines. Our dynamic fusion module controls the binary term using in Equ. (7), and thus the weights for the unary and binary terms are and , respectively. As is dependent on the inputs, spatial locations, and feature channels, we average the weights across them in our records. In self-attention, the weights are calculated by softmax operations as used in the ablation study of Sec. 2.1. It is observed that the weight distribution in self-attention are imbalanced. The unary term only has a weight percentage less than , more than times smaller than the binary term’s. While for LESA, their weight percentages are about and , respectively. In the tasks of object detection where local cues are particularly important, LESA shows better improvement, which is shown in Sec. 3.2

3.2 COCO Object Detection and Instance Segmentation

Settings We perform object detection experiments on COCO dataset lin2014microsoft and use the 2017 dataset splits. There are images in the training set and images in the validation set. In total there are object categories and on average each image contains categories and instances.

The widely used Mask-RCNN he2017mask and HTC chen2019hybrid with the backbones equipped with FPN lin2017feature are used to study LESA for object detection and instance segmentation. We use mmdetection chen2019mmdetection as the codebase. The ImageNet pre-trained checkpoints are utilized to initialize the backbones. There are stages in the ResNet-FPN and the output strides are . We replace the spatial convolutions in the rd and th stages. The images are resized to and in the experiments. Since the image size in classification is , we initialize new position embedding layers used in shaw2018self ; ramachandran2019stand . For training, we employ the and schedules. The total epochs and epochs after which the learning rate is multiplied by are and , respectively. For the HTC framework, we employ multi-scale training for both the baseline and our method: with probability that both sides of the image are resized to a scale uniformly chosen from to , and with another probability to a scale that is uniformly sampled from to . Mask-RCNN does not use multi-scale training.

We also study adopting the deformable unary terms in LESA. Specifically, we replace in Equ. (6) to deformable convolutions dai2017deformable . We set the group number of offsets as . Following the standard setting qiao2020detectors , the convolutions in the nd stage in both the baselines and our models are also replaced with deformable convolutions. Our experiments with Mask-RCNN framework are performed on NVIDIA TITAN XP graphics cards and those with HTC framework on TITIAN RTX graphics cards.

Results The results are summarized in Tab. 345, and 6. We use the same testing pipeline for val2017 and test-dev2017. LESA provides the best bounding box mAP and mask mAP for all the small, medium, and large objects compared with convolution, self-attention, and deformable convolution baselines in all scenarios.

Operations Params (M) Accuracy (%) Weights (%)
Top-1 Top-5 Unary Binary
Convolution
Self-Attention
Static LESA
LESA
(a) Ablation study on ImageNet classification.
Operations
Convolution
Self-Attention
Static LESA
LESA
(b) Ablation study on COCO val2017 object detection.
Operations
Convolution
Self-Attention
Static LESA
LESA
(c) Ablation study on COCO val2017 instance segmentation.
Table 7: Ablation study to investigate the effectiveness of the unary term and the dynamic fusion module. The spatial convolutions in the final stage of pre-trained ResNet50 he2016deep are replaced. For image classification, the first stages are frozen and the models are fine-tuned for warm-up and training epochs. For object detection and instance segmentation, we employ schedule with epochs. It is observed that the dynamic fusion module contributes the most to the performance improvement for classification while the unary term for object detection and instance segmentation. Both of them are important for LESA.

4 Ablation studies

Settings In this section, we perform ablation studies to investigate the unary term and the dynamic fusion module. The static LESA is adding the unary and binary terms with fixed equal weights without regard to the inputs, spatial locations, and feature channels. Besides, we use the group convolution as the unary term with more parameters and representational power for static LESA. Specifically, we take the pretrained ResNet50 he2016deep and replace the spatial convolutions in the last stage. During training for image classification, we freeze the first three stages and adjust the training length to 2 warm-up and 35 training epochs. The other settings follow Sec. 3.2.

Results The results are summarized in Tab. 7. Both static LESA and LESA are benefited from the presence of the unary term and outperform other baselines in object detection and instance segmentation. To detect small and large objects, LESA behaves better than the static one. In classification, the advantage of dynamic fusion is more clear. Both the unary term and dynamic fusion mechanism are important parts of LESA.

5 Related work

5.1 Convolution

Convolutional Neural Networks (CNNs) have become the dominant models in computer vision in the last decade. AlexNet 

krizhevsky2012imagenet shows considerable improvement over the models based on hand-crafted features perronnin2010improving ; sanchez2011high , and opens the door to the age of deep neural networks. Lots of efforts have been made to increase the width and depth and to improve the architecture and efficiency of CNNs in the pursuit of performance. They include the designs of VGG simonyan2014very , GoogleNet szegedy2015going , ResNet he2016deep , WRN zagoruyko2016wide , ResNeXt xie2017aggregated , DenseNet huang2017densely , SENet hu2018squeeze , MobileNet howard2017mobilenets , EfficientNet tan2019efficientnet , etc. Through this process, the convolution layers are also being developed, leading to the grouped convolutions xie2017aggregated , depth-wise separable convolutions chollet2017xception , deformable convolutions dai2017deformable ; zhu2019deformable , atrous convolutions chen2014semantic ; papandreou2015modeling and switchable atrous convolutions qiao2020detectors ; chen2020scaling .

5.2 Self-Attention

The impact of self-attention on vision community is becoming greater. Self-attention is originally proposed in approaches of natural machine translation bahdanau2014neural . It enables the encoder-decoder model to adaptively find the useful information according to contents from a variable length sentence. In computer vision, non-local neural networks wang2018non show that self-attention is an instantiation of non-local means buades2005non , and use it to capture long-range dependencies to augment CNNs for tasks including video classification and object detection. -Net chen20182 employs a variant of non-local means and Attention Augmentation bello2019attention augments the convolution features with attention features, both of which show performance improvement on image classification. Recently, fully attentional methods ramachandran2019stand ; hu2019local ; zhao2020exploring which replaces all the spatial convolutions with self-attention in the deep networks are proposed with stronger performances than CNNs. Axial attention wang2020axial factorizes the 2D self-attention into two 1D consecutive self-attentions which reduces the computation complexity and enables the self-attention layer to have a global kernel. Self-attention also promotes the generation of transformers vaswani2017attention ; dosovitskiy2020image ; touvron2020training ; carion2020end ; wu2020lite ; liu2021swin . BotNet srinivas2021bottleneck relates the transformer block with the fully attentional version of bottleneck block in ResNet he2016deep .

6 Conclusion

From the perspective of fully connected Conditional Random Fields (CRFs), we decouple the self-attention into the local and context terms. They are the unary and binary terms that are calculated by the queries, keys and values in the attention mechanism. However, there lacks distinction between the local and context cues as they are obtained by using the same set of projection matrices. In addition, we observe the contribution of the local terms is very small which is controlled by the softmax operation. By contrast, the standard Convolutional Neural Networks (CNNs) show excellent performances in various vision tasks and rely solely on the local terms.

In this work, we propose Locally Enhanced Self-Attention (LESA). First, we enhance the unary term by incorporating it with convolutions. The multi-head mechanism is realized by using grouped convolution followed by the projection layer. Second, we propose a dynamic fusion module to combine the unary and binary terms. Their relative weights are adaptively changed with specific inputs, spatial locations, and feature channels. We replace the self-attention with LESA and perform the experiments on the challenging large-scale datasets, ImageNet and COCO. All the results demonstrate the superiority of LESA over the convolution and self-attention baselines in the tasks of image classification, object detection, and instance segmentation.

7 Discussion

LESA shares a common limitation with self-attention, which is the large memory consumption. This is due to the large dimensions of the similarity matrix which is computed by the queries and keys and where the softmax operation is applied. Our future works include designing a LESA that consumes small memory but still has the great power of capturing the context information. This will also address the common memory issue in other self-attention models. Like convolution and self-attention, LESA belongs to the type of technical tools that does not introduce any additional foreseeable societal problems. It helps improve the vision models and there is no specific new risk.

References