Besides the fields of machine translation [bahdanau2014neural], speech recognition [chorowski2015attention], attention mechanism has been demonstrated effective in many vision fields such as image captioning [xu2015show, chen2017sca], classification [wang2017residual, woo2018cbam], object detection [ba2014multiple], person re-identification [zhao2017deeply, liu2017end, kalayeh2018human, wang2018mancs, li2018harmonious, liu2018spatial]. Attention tells “where” (i.e., attentive spatial location) and “what” (i.e., attentive channels) to focus to enhance the task-orientated representation power of features.
When facing a visual scene, human can efficiently pay attention to the interested parts [corbetta2002control]. Attention [wang2017residual, woo2018cbam, chen2017sca, wang2018mancs] intends to simulate the human visual system to pay different attention to different features. It acts as a modulator that suppresses the unnecessary features and reinforces the important ones, by multiplying a mask. It balances the relative importance among different components. For CNNs, convolutional operations with limited receptive fields are usually used to compute the attention [chen2017sca, woo2018cbam, wang2017residual, wang2018mancs]. However, locally learned attention is hard to reach optimal representativeness. One solution is to use large filter size in the convolution layer [woo2018cbam]. The other solution is to stack deep layers [wang2017residual] which increases the network size greatly. Besides, the studies in [luo2016understanding] show that the effective receptive field of CNN only takes up a fraction of the full theoretical receptive field. These solutions cannot ensure the effective capture of global information, thus limiting the exploration of correlations among features for attention learning.
Inspired by the common sense that people often know the importance of something through comparison, in this paper, we propose a simple yet effective Relation-Aware Global Attention (RGA) module to globally learn the attention. For each node (e.g., a feature vector of a spatial position), we take the pairwise relations of all the nodes with respect to the current node to represent the global structure information to learn attention. As illustrated in Fig. 1 (c), for each node (feature vector), we grasp global information from all nodes by calculating their pairwise relations i.e., correlation/similarities, and then stack these pairwise relation values together with the original feature to be the feature of this node for deriving its attention. With global structure information embedded in each local position, convolution operations are used to infer the attention intensity.
We showcase the effectiveness of the RGA in the task of person re-identification, where abundant previous works strive to address the challenges by leveraging attention designs [zhao2017deeply, liu2017end, kalayeh2018human, wang2018mancs, li2018harmonious, liu2018spatial]. Extensive ablation studies demonstrate the effectiveness of our design in comparison with other attention mechanisms. The models with our proposed attention appended achieve the state-of-the art performance on the benchmark datasets CUHK03 [li2014deepreid], Market1501 [zheng2015scalable], and MSMT17[wei2018person]. Moreover, when applied on top of the strong non-local neural networks, our RGAM can further increase the accuracy. This also indicates our relation-aware global attention is complementary to the non-local filter idea [wang2018non].
To demonstrate the generality of the proposed relation-aware global attention modules, we further conduct experiments for the task of scene segmentation on the popular dataset Cityscapes [cordts2016cityscapes], and image classification on the CIFAR dataset [krizhevsky2009learning]
. Experimental results show that our global attention models consistently improve the performance over the baselines.
In summary, we have made three major contributions:
We propose to globally determine the attention by taking a global view of the mutual relations among the features. Specifically, for a feature node, we propose a compact representation by stacking the pairwise relations with respect to this node and the feature itself.
We design a relation-aware global attention (RGA) module based on the relation-based compact representation and shallow convolutional layers. We apply such design to spatial and channel dimensions respectively, and demonstrate the effectiveness of this global attention module in both cases and the their combination case.
Extensive ablation studies on person re-identification demonstrate the effectiveness of our proposed RGA, which provides the state-of-the-art performance. We also demonstrate the general applicability of RGA to other vision tasks such as scene segmentation and image classification.
2 Related Work
Attention. Attention aims to focus on important features and suppress irrelevant features. Intuitively, to have a good sense of whether a local region is important or not, one should know the global scene. For CNNs, most of the current selective attention modules learn attention from limited local contexts. Wang et al. [wang2017residual] propose an encoder-decoder style attention module by stacking many convolutional layers. In [zhang2019nonlocal], a non-local block [wang2018non] is inserted before the encoder-decoder style attention module to enable attention learning based on globally refined features. In the Convolutional Block Attention Module [woo2018cbam], a convolution layer with a large filter size of 77 is applied over the cross-channel pooled spatial features to produce a spatial attention map. Limited by the practical receptive fields, all these approaches are not efficient in capturing the large scope information to globally determine the spatial attention. In [liu2018spatial], for spatial attention model, two FC layers are applied to the cross-channel average pooled feature map to generate a spatial attention map. Each spatial position corresponds to a FC node with unshared parameters. This may require a large number of parameters and make the training difficult. Hu [hu2018squeeze] use two FC layers over spatially average-pooled features to compute channel-wise attention. All these methods do not explicitly exploit the inter-channel (or spatial) relations to enable the assessment of relative importance.
We address these problems by proposing a Relation-Aware Global Attention module. For each feature position, we embed the pairwise relation between this position and all the positions to capture the global information. Then, two convolutional layers with translation invariant property are used to efficiently exploit the global structural information.
Non-local/Global Information Exploration. Exploration of non-local/global information has been demonstrated to be very useful for image denoising [buades2005non, BM3DTIP07, Lefki2017non], texture synthesis [efros1999texture]glasner2009super], inpainting [barnes2009patchmatch], and even high level tasks such as image recognition and object segmentation [wang2018non]. In [buades2005non, wang2018non], they adopt the non-local mean idea which computes a weighted summation of the non-local and local pixels/features as the refined representation of the target pixels/features. The weight value connecting every two positions represents their relationship and is calculated from the similarity/correlation of the pair. All the positions (nodes) with the mutually connected edges construct a global graph. In [guo2018neural], the graph structure (adjacency) is used as part of the features for graph matching for fewshot 3D action recognition.
Note for one node, all its edges and the corresponding location information of the connected nodes deliver the global relationships with all the nodes. This information can tell the clustering states and spatial clustering patterns. In this work, we leverage this global structural information associated with a node to globally learn the attention (that will be used to module the feature of this node) using convolutional layers within general CNN frameworks. This is in contrast to the weighted summarization operation in non-local mean. We will experimentally demonstrate their complementary roles in Section 4.
3 Relation-Aware Global Attention
For attention, to have a good sense of the importance of a feature, one should know the others for an objective assessment. Thus, the global information is essential. We thus propose a relation-aware global attention module which makes use of the structural relation information. In the following, we first give the problem formulation and present our main idea in Subsection 3.1. For CNNs, we decouple the attention into a spatial relation-aware global attention and a channel relation-aware global attention and elaborate them in Subsection 3.2 and 3.3, respectively. Finally, we simply introduce the joint using of them in Subsection 3.4.
3.1 Problem Formulation and Main Idea
Generally, for a feature set of correlated features with each of dimensions, our goal is to learn a mask denoted by for the features to weight them according to their relative importance. The updated feature through attention is .
To learn the attention value of the feature vector, there are two common strategies as illustrated in Fig. 1 (a) and (b). (a) Local attention: each feature determines its attention locally, , using a shared transformation function on itself, , . This local strategy does not fully exploit the correlations with other features [chen2017sca]. For vision tasks, deep layers [wang2017residual] or large-sized kernels [woo2018cbam] are used to make the attention learning more global. (b) Global attention: one straightforward solution is to use all the features together to jointly learn attention, , using fully connected operations. However, this is usually computationally expensive and requires a large number of parameters especially when the number of features is large [liu2018spatial].
In contrast to these strategies, we propose a relation-aware global attention that enables i) the exploitation of global structural information, and ii) the use of shared transformation function for different individual features to derive the attention. For visual tasks, the later makes it possible to globally compute the attention by using local convolutional operations. We illustrate our basic idea in Fig. 1 (c) Proposed relation-aware global attention.
The main idea is to exploit the pairwise relation (affinity/similarity) related to the feature to represent this feature node’s global structural information. Specifically, we use to represent the affinity between the feature and the feature. For the feature , its affinity vector is . Then, we use the feature itself and the pairwise relations, , , as the feature used to infer its attention using a shared transformation function. Note that contains global information.
Mathematically, we denote the set of features and their relations by a graph , which comprises the node set of features, together with an edge set and . The edge represents the relation between the node and the
node. The pairwise relations for all the nodes can be represented by an affinity matrix, where the relation between node and is = . .
3.2 Spatial Relation-Aware Attention
Given an intermediate feature map (tensor)of width , height , and channels from a CNN layer, we design a spatial relation-aware attention block, namely RGA-S, for learning a spatial attention map of size . We take the -dimensional feature vector at each spatial position as a feature node. All the spatial positions form a graph of nodes. As illustrated in Fig. 2 (a), we raster scan the spatial positions and assign their identification number as 1,, . We represent the feature nodes as , where .
The pairwise relation (affinity) from node to node can be defined as a dot-product affinity in the embedding spaces as:
where and are two embedding functions implemented by a, , where and . is a pre-defined positive integer which controls the dimension reduction ratio. Note that BN operations are all omitted to simplify the notation. Similarly, we can get the affinity from node to node as . We use a pair to describe the bi-directional relations between and . Then, we represent the pairwise relations for all the nodes by an affinity matrix .
For the feature/node, we collect its pairwise relations with all the nodes, , a relation vector , to represent the global structural information. As illustrated in Fig. 2 (a), the sixth row and the sixth column of the affinity matrix , , , is taken as the relation features for deriving the attention of the sixth spatial position.
To infer the attention of the feature/node, besides the pairwise relation items , we also include the feature itself to make use of both the global mutual relations and the local original information. Considering these two kinds of information are not in the same feature domain, we embed them respectively and concatenate them to have the spatial relation-aware feature :
where and denote the embedding functions for the feature itself and the global relations, respectively. They are both implemented by a spatial convolutional layer followed by Batch Normalization and ReLU activation, , , , where , . denotes global average pooling operation along the channel dimension to further reduce the dimension to be 1. Then . Note that other convolution kernel sizs 33 can also be used. We found they achieve very similar performance and we will use convolutional layer for lower complexity.
The spatial attention value for the feature/node is then obtained by:
where and are implemented by convolution followed by batch normalization. shrinks the channel dimension with a ratio of and transforms the channel dimension to 1.
Note that all these operations are achieved by convolution operations and the global relations are also exploited.
3.3 Channel Relation-Aware Attention
Given an intermediate feature map (tensor) , we design a relation-aware channel attention block, namely RGA-C, for learning a channel attention vector of dimensions. We take the -dimensional feature vector at each channel as a feature node. All the channels form a graph of nodes. We represent the feature node as , where .
Similar to spatial relation, the pairwise relation (affinity) from node to node can be defined as a dot-product affinity in the embedding spaces as:
where and are two embedding functions that are shared among feature nodes. We achieve the embedding by first spatially flattening input tensor into and then using a convolution layer with batch normalization followed by ReLU activation to perform a transformation on . As illustrated in Fig. 2 (b), we obtain and then represent the pairwise relations for all the nodes by an affinity matrix .
For the feature/node, we collect its pairwise relations with all the nodes, , a relation vector , to represent the global structural information. As illustrated in Fig. 2 (b), the third row and the third column of the affinity matrix , , , is taken as the relation features for deriving the attention of the third channel node.
To infer the attention of the feature/node, similar to the derviation of spatial attention, besides the pairwise relation items , we also include the feature itself . Similar to Eq. (2) and (3), we obtain the channel relation-aware feature and then the channel attention value for the channel. Note that all the transformation functions are shared by nodes/channels. There is no fully connection operations across channels.
3.4 Joint Relation-Aware Global Attention
The spatial attention RGA-S and channel attention RGA-C play complementary roles. Both can be applied in any stage of convolution networks and trained in an end-to-end manner without any additional auxiliary supervision. They can be used alone or combined. We suggest to jointly use them in a sequential manner because of its lower training difficulty relative to the parallel manner. Take the sequential spatial-channel combination as example. Given an intermediate feature map (tensor) , after the modulation by the spatial RGA, we get the feature map . Afterwards, channel RGA is derived from and applied on . We will discuss the results of using each alone versus in combination, and sequential aggregation vs. parallel aggregation in the next section.
We showcase the effectiveness of our Relation-Aware Global Attention on the person re-identification task in Subsection 4.1. Extensive ablation studies demonstrate the effectiveness of our designs. Our models achieve the state-of-the-art performance on the CUHK03 [li2014deepreid], Market1501 [zheng2015scalable] and MSMT17[wei2018person] datasets. As extension experiments, we also investigate our models on scene segmentation in Subsection LABEL:subsec:results-seg and image classification in Subsection LABEL:subsec:results-class.
4.1 Experiments on Person Re-identification
Person re-identification (re-ID) aims to match a specific person in different occasions from the same camera or across multiple cameras. It has become increasingly popular in both research and industry community due to its application and research significance [zheng2016person, lavi2018survey]. Generally, given an input image, we employ a convolutional neural network to obtain a feature vector. Re-identification is to find the images with the same identity by matching the feature vectors of images (based on feature distance).
4.1.1 Implementation Details and Datasets
Network Settings. Following the common practices in re-ID systems [bai2017deep, sun2017beyond, zhang2017alignedreid, almazan2018re, zhang2019densely], we take ResNet-50 [he2016deep] to build our baseline network and apply our RGA modules to the RetNet-50 backbone for effectiveness validation. Similar to [sun2017beyond, zhang2019densely], the last spatial down-sampling operation in the conv5_x block of ResNet-50 is removed. Except for special instructions, in our re-ID experiments, we adopt the proposed RGA modules after all of the four residual blocks (including conv2_x, conv3_x, conv4_x and conv5_x). Within RGA modules, we set the ratio parameters and to be 8. In all our re-ID experiments, we use both identification (classification) loss with label smoothing [szegedy2016rethinking] and triplet loss with hard mining [hermans2017defense] as supervision signals. Note that we do not implement re-ranking [zhong2017re] for clear comparisons in all our experiments. More details please see our supplementary.
Training. We use the commonly used data augmentation strategies of random cropping [wang2018resource], horizontal flipping, and random erasing [zhong2017random, wang2018resource, wang2018mancs]. The input image size is set as
for all the datasets. The backbone network is pre-trained on ImageNet[deng2009imagenet]. We adopt Adam optimizer with the learning rate of and the weight decay of . Please refer to the supplementary for more details.
Datasets and Evaluation Metrics
Datasets and Evaluation Metrics. We conduct experiments on four public person re-ID datasets, , CUHK03 [li2014deepreid], Market1501 [zheng2015scalable], DukeMTMC-reID [zheng2017unlabeled] and the large-scale MSMT17 [wei2018person]. To save space, the introduction of the datasets are placed in our supplementary. For CUHK03, we use the new protocol as used by [zhong2017re, zheng2018pedestrian, he2018recognizing]. We follow the common practices and use the cumulative matching characteristics (CMC) at Rank-1 (R1), and mean average precision (mAP) to evaluate the performance.
4.1.2 Ablation Study
We perform the ablation studies on the CUHK03 (with the Labeled bounding box setting) and Market1501 datasets.
RGA related Models versus Baseline. Table 1 shows the comparisons of our spatial RGA (RGA-S), channel RGA (RGA-C), mixed RGA with both channel and spatial RGA, and the baseline. We have the following observations.
1) Either our spatial RGA model (RGA-S) or channel RGA model (RGA-C) significantly improves the performance over our powerful baseline. On CUHK03, RGA-S, RGA-C, and the combined spatial and channel RGA model RGA-SC significantly outperform the baseline by 5.7%, 5.9%, and 7.5% respectively on mAP accuracy, and 5.5%, 5.2%, and 6.6% respectively on Rank-1 accuracy. On Market1501, even though the performance on baseline is already very high (83.7% on mAP), our RGA-S and RGA-C improve the mAP accuracy by 3.1% and 3.5% respectively.
2) For learning attention, even without taking the visual features (Vis.), , feature itself, as part of the input, the proposed global relation representation itself (RGA-S w/o Vis. or RGA-C w/o Vis.) can significantly outperform the baseline’s performance, 4.4% or 4.3% gain over baseline in Rank-1 accuracy on CUHK03.
3) For learning attention, without taking the proposed global relation (Rel.) as part of the input, the scheme RGA-S w/o Rel. or RGA-C w/o Rel. outperforms the baseline, but is inferior to our scheme RGA-S or RGA-C by 2.5% or 1.2% in Rank-1 accuracy on CUKH03.
4) The combination of the spatial RGA and channel RGA achieves the best performance. We study three ways of combination: parallel with a fusion (RGA-S//C), sequential spatial-channel (RGA-SC), sequential channel-spatial (RGA-CS). Sequential spatial-channel RGA-SC achieves the best performance, 1.8% and 1.6% higher than RGA-C and RGA-S, respectively, in mAP accuracy on CUHK03. Parallel optimization is more difficult than the sequential one.
|Spatial||RGA-S w/o Rel.||76.8||72.3||94.3||83.8 [t]|
|RGA-S w/o Vis.||78.2||74.0||95.4||86.7|
|Channel||RGA-C w/o Rel.||77.8||73.7||94.7||84.8 [t]|
|RGA-C w/o Vis.||78.1||74.9||95.4||87.1|
RGA versus Other Attention Mechanisms. Table 2 compares the performance of our attention modules with the other state-of-the-art attention designs. For fairness of comparison, we re-implement their attention designs on top of our baseline. 1) Channel attention. There are several channel attention designs. In Squeeze-and-Excitation module (SE) [hu2018squeeze], they use spatially global average-pooled features to compute channel-wise attention, by using two fully connected (FC) layers with the non-linearity. In comparison with SE, our RGA-C achieves 2.7% and 3.0% gain in Rank-1 and mAP accuracy. CBAM-C [woo2018cbam] is similar to (SE) [hu2018squeeze]
but it additionally uses global max-pooled features. Similarly,FC-C [liu2018spatial] uses a FC layer over spatially average pooled features. Before their pooling, the features are further embedded through convolutions. Thanks to the exploration of pairwise relations, our scheme RGA-C(Ours) outperforms FC-C[liu2018spatial] and SE [hu2018squeeze] which also use global information by 1.6% and 2.7% in Rank-1 accuracy on CUHK03. On Market1501, even the accuracy is already very high, our scheme still outperforms others. 2) Spatial attention. For spatial attention designs, CBAM-S [woo2018cbam] uses large filter size of 77 to learn attention while FC-C[liu2018spatial] uses fully connection over the spatial feature maps. Our scheme achieves the best performance, which is 2% better than the others in Rank-1 accuracy on CUHK03. 3) Spatial and channel attention. When both spatial and channel attentions are utilized, our models consistently outperform both the channel attention and spatial attention.
|Spatial||CBAM-S [woo2018cbam]||77.3||72.8||94.8||85.6 [t]|
|RGA-S (Ours)||79.3||74.7||95.4||86.8 [b]|
|Channel||SE [hu2018squeeze]||76.3||71.9||95.2||86.0 [t]|
|RGA-C (Ours)||79.0||74.9||95.4||87.2 [b]|
|RGA-SC (Ours)||80.4||76.5||95.8||88.1 [b]|
RGA versus Non-local Blocks. Both non-local neural network [wang2018non] and our RGA utilize the local and non-local pairwise relations but with rather different purposes. The non-local blocks function in a way similar to the non-local means [wang2018non], which computes the response at a position as a weighted sum of the features at all positions. The pairwise relation, in terms of similarity/affinity, is used as the weight. In contrast, we leverage the collection of pairwise relations to represent the overall relation between the current position’s feature and all the positions’ features for inferring the attention. The two modules are complementary in concepts and functions. We experimentally demonstrate this and show results in Table 3. Non-local denotes the scheme after integrating non-local blocks [wang2018non] to our baseline and Non-local + RGA-S denotes that our spatial RGA modules are also integrated after the non-local blocks. On top of the non-local networks, the introduction of spatial-wise relation-aware global attention significantly further improves the performance, , by 4.0% and 4.2% in Rank-1 and mAP accuracy, respectively on CUHK03.
|Non-local + RGA-S||80.6||76.8||95.6||88.0 [b]|
Which ConvBlock to Add RGA-SC? We compare the cases of adding the RGA-SC module to different residual blocks. The RGA-SC brings gain on each residual blocks and adding it to all blocks performs best. Please refer to the supplementary for more details.