Deep learning, based on local convolutions, has achieved great success in many fields. Recently, many studies show that modeling long-range dependencies can attain consistent improvements in many computer vision tasks, such as semantic segmentation [huang2019ccnet, song2019learnable], object detection [wang2018non], and image retrieval [ng2020solar]. Long-range dependencies, containing global contextual information, are usually captured by using local convolutional operations [he2016deep, chen2017deeplab] or non-local blocks [huang2019ccnet, song2019learnable, wang2018non].
Conventional convolutions operate data in a local window and many methods enlarge receptive fields by repeating local convolutional operations for capturing long-range dependencies [simonyan2014very, szegedy2015going]. These operations are inefficient to model long-range dependencies and may encounter optimization difficulties [he2016deep]. Considering the problems, several non-local operations are provided to model long-range dependencies, such as Non-local networks [wang2018non], CCNet [huang2019ccnet], and GCNet [cao2019gcnet]. In these methods, each spatial position in a feature map is seen as a node, and each node can get supports from all other nodes according to a weight function. The weight function is usually a dot-product similarity between central and other nodes. This way just considers similarity but ignores proximity between two nodes when capturing long-range dependencies. Due to the absence of spatial distances when capturing long-range dependencies, object shapes cannot be modeled. Actually, shape-awareness has been confirmed to be useful in many computer vision tasks [dai2017deformable, zhu2019deformable, luo2020aslfeat]. Tree filters [yang2014stereo, song2019learnable] are adopted to preserve object structures, where minimum spanning trees in low-level feature maps are established and then the process of global context aggregation on high-level feature maps is performed. The efficiencies of spanning trees based methods can still be developed furthermore.
In this paper, we propose a Semi-Global Shape-aware Network (SGSNet) to address the above-mentioned problems. In order to consider both feature similarity and proximity for preserving object shapes when modeling long-range dependencies, we aggregate global contextual information for each position in a hierarchical way. In the first level, each position in the whole feature map only aggregates contextual information in vertical and horizontal directions. Then the result of previous level is input into the next level to do the same operations. By this hierarchical way, each central position gains supports from all other positions, and the combination of similarity and proximity makes each position gain supports mostly from the same semantic object. We also propose an efficient algorithm reducing the computational complexity of the brute force implementation to
. As a contrast, the computational complexities of Non-local Neural Networks[wang2018non] and CCNet [huang2019ccnet] are and respectively. The computational complexity of [song2019learnable] is also while it needs extra time and computations to establish minimum spanning trees.
Differences between our method and CCNet are illustrated in Fig. 1. The weight of two nodes just depends on similarity between themselves for CCNet. In our method, the weight of two nodes considers both similarity and proximity for preserving object shapes when capturing long-range dependencies. The details are described in Sec. 3.
In summary, our main contributions are:
We propose a SGSNet which considers both feature similarity and geometric proximity for preserving object shapes when modeling long-range dependencies.
For practical usages, we propose an efficient algorithm for the aggregation of contextual information which reduces the computational complexity to
. Thus SGSNet can be plugged into existing convolutional neural networks conveniently.
Experiments on different computer vision tasks (semantic segmentation and image retrieval) show that adding SGSNet to existing deep neural networks achieves higher accuracies while takes less computations.
2 Related Work
Self-attention [vaswani2017attention] was initially applied in machine translation area. Non-local Networks [wang2018non] bridged self-attention modules in machine translation to non-local filtering operations in computer vision. Many methods improved weight functions in self-attention module to learn discriminative feature representation. Non-local Networks [wang2018non] discussed four choices of weight functions, i.e. Gaussian, Embedded Gaussian, Dot product and Concatenation. Considering the computational complexity of Non-local Networks, CCNet [huang2019ccnet] adopted recurrent sparse criss-cross attention modules to substitute the dense attention in Non-local Networks. By two consecutive criss-cross attention modules, every node could collect contextual information from all nodes in feature maps. This way reduced the computational complexity for Non-local Networks to , therefore it was more efficient, while object shapes could be further considered. Song et al. [song2019learnable] first built minimum spanning trees (MST) in low-level feature maps and then computed feature similarity in high-level semantics to preserve object structures. However extra time and computations to build the minimum spanning trees were needed. In this work, we consider both spatial distances and feature similarity when capturing long-range dependencies for higher accuracies meanwhile maintaining a low computational complexity.
2.2 Semantic Segmentation
Semantic segmentation is an essential and challenging task in computer vision community. The methods based on Convolution Neural Networks (CNNs) have made significant achievements in the past few years. Long et al. [long2015fully] applied fully convolutional networks (FCNs) in semantic segmentation. Later, researchers found that FCNs were limited by receptive fields due to fixed geometric structures. The works of U-net [ronneberger2015u] and DeepLabv3+ [chen2018encoder] used encoder-decoder structures to combine high-level and low-level features for semantic segmentation tasks. Chen et al. [chen2017deeplab] adopted atrous convolutions which could effectively enlarge receptive fields when aggregating contextual information. Furthermore, they also proposed atrous spatial pyramid pooling (ASPP) to complete the segmentation task at multiple scales. Zhao et al. [zhao2017pyramid] exploited global context information by pyramid pooling modules to achieve better performance. PSANet [zhao2018psanet] relaxed the local neighborhood constraint through a self-adaptively learned attention mask. GCN [peng2017large] found that large convolutional kernels were also important when performing a dense per-pixel prediction task and proposed Global Convolutional Network (GCN). Unlike the previous methods, we aggregate global context information in a self-attention manner.
2.3 Image Retrieval
Traditionally, bag-of-visual-words [philbin2007object], VLAD [arandjelovic2013all]
and Fisher vector[perronnin2010large] are usually used to aggregate a set of handcrafted local features [lowe2004distinctive] into a single global vector to represent an image. Recently, many methods attempt to replace handcrafted ones by learned counterparts and then aggregate learned features with these similar techniques as the traditional ones [arandjelovic2016netvlad, liu2019stochastic, ge2020self]. Some studies show that directly using a pooling operation [radenovic2018fine] to substitute the aggregation process can get comparable performances. We follow [ng2020solar] to name these methods with a pooling operation as global single-pass methods for they do not separate extraction and aggregation steps explicitly. Radenović et al. [radenovic2018fine]
proposed GeM pooling which generalized average and max pooling and got excellent results. Based on GeM, SOLAR[ng2020solar] employed second-order similarity and attention for image retrieval and obtained significant performance improvements. In this paper, we also explore the proposed SGSNet components for image retrieval.
3 Semi-Global Shape-aware Network
In this section, we first introduce preliminary formulations which consider feature similarity and geometric proximity for preserving object shapes when modeling long-range dependencies. Then we present the proposed semi-global shape-aware Network. Finally, we propose a linear time algorithm for implementing the network.
3.1 Preliminary Formulations
First, a given feature map in CNNs can be seen as a connected, undirected graph . The nodes are all spatial positions in and the edges with weights are all connections between two neighbor nodes. Next, we define a weight function between different nodes. Let us consider a simple case: given a pair of neighbor nodes and , the weight between and is defined as
where is Euclidean distance between feature vectors of and . When nodes and are not neighbors, the weight function is given by
where and are neighbor nodes in the shortest path between and . We denote the feature map after aggregating contextual information as . The aggregation function which takes feature similarity and geometric proximity into account simultaneously is:
where and denotes the feature vector at in and respectively, is all nodes in the graph , the function means feature transformation and the weight is normalized by .
In the above aggregation process, a problem is that there may be a lot of paths between arbitrary nodes and and it is time-consuming to find the shortest path for all pairs of nodes. In [yang2014stereo, song2019learnable], the authors adopt a spanning tree to remove ”unwanted” edges in the four-connected planner graph , thus all the nodes are connected by a minimum spanning tree. The minimum spanning tree can enlarge the geometric proximity between two neighbor nodes if these two nodes are dissimilar in appearance. As a result, low support weights will be assigned between these two nodes which causes less robustness to textures [yang2014stereo]. Besides, establishing and traversing a minimum spanning tree needs to pay extra time and computations. Inspired by [hirschmuller2007stereo, huang2019ccnet], we adopt a semi-global manner to overcome this difficulty by considering the supports from nodes in horizontal and vertical directions in a single block, and then add multiple blocks hierarchically for capturing full-image dependencies. The details are described in Sec. 3.2.
3.2 A Semi-Global Block
Due to the difficulty of minimizing matching costs in a 2D image, SGM [hirschmuller2007stereo] utilizes a semi-global approach to aggregate the matching costs along multiple 1D directions. CCNet [huang2019ccnet] just collects contextual information in a criss-cross path. In this paper, we establish a semi-global block, as shown in Fig. 2 (a). Given an input feature map with channels , width and height , the block first utilizes a 1 1 convolution on to reduce . The resulting feature map is denoted as with . We compute an attention map by a weight function on . Channels at a position of correspond to the weights between this position and other positions in the same column or row respectively. The block also applies another 1 1 convolution on and the resulting feature map has the same size as . For each position in the spatial dimension of , we can obtain a feature vector . We collect the feature vectors of all positions in horizontal and vertical directions of on , resulting in a matrix . We denote the output feature map of the block as , and the feature vector at position on by (3) is:
where is the index of channels of , denotes the value at the channel in position on . denotes the feature vector in , and denotes the feature vector at position on .
It is worth noting that the weight function is quite different from [huang2019ccnet]. The aggregation process is shown in Fig. 2 (b), where the weight between nodes and depends on all nodes in the path connecting and . The path is limited in horizontal and vertical directions. Actually, our weight function (2) in the aggregation process considers feature similarity and geometric proximity simultaneously, as described in Sec 3.1. Therefore, our method can preserve object shapes more clearly compared with [huang2019ccnet], which will be shown in Sec. 4.1.4. In order to adjust feature similarity between nodes and , we further add learnable parameters and in (2):
where , and are neighbor nodes in horizontal or vertical path connecting and . Considering that width and height of feature maps may be different, we use and for horizontal and vertical directions respectively. If () is smaller, feature similarity plays a more important role in (5), and vice versa. Therefore, the learnable parameter () can adjust the relation of feature similarity and geometric proximity adaptively.
3.3 Hierarchical Semi-Global Blocks
A single semi-global block can only aggregate contextual information in the same row and column of a central node but ignores other directions. We address this problem by multiple hierarchical semi-global blocks. In the first hierarchical level, a feature map is input into a semi-global block, producing a feature map . Then in the second hierarchical level, another semi-global block takes as the input and outputs a feature map . Thus, every node in can aggregate contextual information from all nodes in . Specifically, we denote the attention maps in the first and second hierarchical levels as and . The process of spreading contextual information in node (in purple) to node which is not in the same row and column as is illustrated in Fig. 3. In the first semi-global block, only nodes in the same row and column can receive the contextual information from , which can be seen at in Fig. 3. In the second semi-global block, receives contextual information from or which have aggregated contextual information from in the first semi-global block, as shown at in Fig. 3. Thus, obtains contextual information of through the above process.
More generally, every node can capture full-image contextual information by the hierarchical semi-global blocks, which can enhance the capability of a single semi-global block for modeling long-range dependencies [huang2019ccnet].
3.4 A Linear Time Algorithm for Efficient Implementation
For a clearer explanation, we ignore 1 1 convolution and in Fig. 2, which does not influence our conclusion. In a semi-global block, one node is supposed to aggregate contextual information of those in vertical and horizontal directions. According to (4), the total computational complexity of a single semi-global block is by the brute force implementation. Specifically, given an input feature map with nodes, every node needs to aggregate contextual information from other nodes in the same row and column. Thus, we need computation times of similarity for every node. The total computation is , which is unfavorable for practical applications.
Noticing that there are extensive repeated computations in the brute force implementation, we propose an optimized alternative which reduces the above computational complexity of a single semi-global block to . We innovatively regard each of rows and columns on the given feature map as a binary tree and complete interactions between any two nodes on the binary tree in linear time or . Thus total computational complexity is for columns and rows. Specifically, let us consider operations on one of the rows or columns, as shown in Fig. 4. We select one node arbitrarily in the given row or column as a root node to establish a binary tree. Except the root node and leaf nodes, every node in the binary tree has a parent node and a child node. Every node is supposed to capture contextual information of others in this binary tree. The process of capturing contextual information is split into aggregation and updating steps:
We aggregate contextual information from leaf nodes towards the root node recursively according to
where is the input feature vector at node , means is the parent node of . is the defined weight function between and in (5).
We update features from the root node towards leaf nodes recursively by
where is computed by (6).
The computational complexity of the aggregation and that of updating steps are both for columns or for rows. Thus the total computational complexity is or for a single binary tree.
Operations on other columns and rows are similar. We do not perform updating steps until the aggregation steps are completed for all binary trees. Thus, the aggregation and updating steps of each binary tree are independent of others, which can be parallel accelerated on GPUs. The output feature map is denoted as . A node is included in two binary trees (), as shown in Fig. 4. We add the updating results of in and to :
where is the feature vector at in , and has contained contextual information in horizontal and vertical directions of in .
In this section, we evaluate SGSNet on semantic segmentation and image retrieval. We first conduct extensive experiments on semantic segmentation to understand the behavior of SGSNet, and then extend SGSNet to image retrieval to demonstrate the generality of SGSNet.
4.1 Experiments on Semantic Segmentation
We adopt Cityscapes [cordts2016cityscapes] benchmark for semantic segmentation and report the metrics of Mean IoU (mean of class-wise intersection over union). Cityscapes is a semantic segmentation benchmark focusing on urban street scenes. This benchmark contains 5,000 finely annotated images which are divided into 2975, 500, 1525 images as training, validation and testing set. A larger set of 20,000 coarsely annotated images are also provided for supporting methods that exploit large volumes of weakly-labeled data. The dataset also defines 30 visual classes, of which 19 classes are used in our experiments.
Based on the proposed linear time algorithm, SGSNet can be plugged into any deep neural networks conveniently for capturing long-range dependencies. We adopt ResNet-101 [he2016deep]
(pre-trained on ImageNet) as our backbone with minor changes. The last two down-sampling layers are dropped and dilated convolutions are embedded into subsequent convolutional layers similar to[huang2019ccnet].
We use the mini-batch stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 0.0001. A poly learning rate schedule with an initial learning rate of 0.01 and power of 0.9 is employed. We also augment the training set by randomly scaling (0.75 to 2.0) and then crop out patches with 769 769 pixels as the network input resolution. The learnable parameter and are both initialized with 1. The number of channels in is one eighth of the number of channels in and we share the parameters of the hierarchical semi-global blocks.
4.1.1 Comparisons with Other Methods
We first train SGSNet on Cityscapes training set and present results on Cityscapes validation set in Tab. 1. We can see that our method achieves better results than other methods, even if DeepLibv3+ [chen2018encoder] and DPC [chen2018searching] use a more powerful backbone.
We further compare our method with existing methods on the Cityscapes testing set, as shown in Tab. 2. SGSNet is trained with only finely annotated data and then the test results are submitted to the official evaluation server. It can be seen from Tab. 2 that SGSNet outperforms other methods no matter that they employ either the same backbone as ours or a stronger one [yuan2018ocnet, yang2018denseaspp]
. Even if CCNet freezes batch normalization (BN) layers and finetunes the model with a low learning rate after training certain iterations while we do not use any of those tricks, our method is still better than CCNet. This proves the importance of considering both feature similarity and geometric proximity for preserving object shapes when capturing long-range dependencies. Furthermore, our SGSNet is more efficient, which will be elaborated in Sec.4.1.2.
4.1.2 Analysis of Computational Complexity
Based on the linear time algorithm described in Sec. 3.4, the computational complexity of SGSNet is , which is linearly proportional to the number of nodes in a feature map. We explore the computation cost and the number of parameters of SGSNet on Cityscapes validation set. We also use ResNet-101 as the backbone and the input size is 769 769 pixels, thus the size of input feature maps of the hierarchical semi-global blocks is 97 97 pixels. The baseline network is ResNet101 with some minor changes, where dilated convolutional layers are adopted at stage 4 and 5. As shown in Tab. 3, SGSNet improves the performance by 5.8% mIoU over the baseline with additional 0.295M parameters and 11.4G FLOPs overheads. We also list the GFLOPs and parameters of Non-local and CCNet in Tab. 3. It can be seen that SGSNet achieves higher performance while uses less parameters and FLOPs than CCNet. Non-local uses slightly less parameters than SGSNet, but the additional FLOPs of Non-local is far greater (108G VS 11.4G) than SGSNet. Also, SGSNet increases mIoU by 1.8% compared with Non-local. Therefore, SGSNet is more efficient and effective than the other two.
|Method||GFLOPs ()||Params (M )||mIoU(%)|
Comparisons of SGSNet with CCNet and Non-local. The increments of FLOPs and parameters are estimated for an input of 13769769 pixels.
4.1.3 Ablation Studies
In order to further understand the behavior of SGSNet, we implement substantial ablation studies on Cityscapes validation set. The increments of FLOPs and parameters with different numbers of the hierarchical semi-global blocks are listed in Tab. 4. Adding a semi-global block increases 4.4% mIoU compared with the baseline network, which indicates importance of the semi-global block. We further add two semi-global blocks hierarchically and the performance is increased by another 1.4% mIoU. These results demonstrate that the proposed hierarchical semi-global blocks can capture full-image contextual information and significantly improve the performance. We believe that the performance can be further improved by adding more semi-global blocks hierarchically. We also can see from Tab. 4 that a single semi-global block just increases 5.70G FLOPs and 0.295M parameters. Parameters do not increase more when adding the second block because they share parameters with each other. Qualitative results are also given in Fig. 5
. Areas (truck, ground and fence) in red circles of the first column images are easily classified erroneously. Those areas can’t be classified correctly by just adding one semi-global block, as shown at the second column in Fig.5. But when we add two semi-global blocks hierarchically, those areas are rectified, as shown at the following third column, which explicitly demonstrates the advantages of hierarchical semi-global blocks.
We also explore the influence of learnable parameters and . We use two semi-global blocks and remove and in (5). The results are listed at the fourth row in Tab. 4. As we can see, the performance is dropped by 0.7% mIoU (compared with the last line in Tab. 4) due to the absence of and .
|Hierarchical||GFLOPs ()||Params (M )||mIoU(%)|
4.1.4 Visualization of Attention Module
To further elaborate what SGSNet has learned, we visualize the learned attention maps of SGSNet in the last column of Fig. 6. For each image in the left column, we choose a specific position (marked by a cross in red) and show its corresponding attention map of SGSNet in the right column. Images in the middle column are the corresponding attention maps of CCNet. Both SGSNet and CCNet use two blocks. We can observe that, compared with CCNet, SGSNet can capture clear semantic similarity and learn object shapes simultaneously. For example, in the first image, the marked position in truck obtains almost all high responses from positions in truck and we can see the outline of the truck in the corresponding attention map of SGSNet clearly. Similar phenomena can be seen at the marked positions (in the building and traffic sign) of the second and third images. Areas in same objects are activated and have the high responses, which means that the proposed SGSNet can be aware of object shapes when capturing long-range dependencies.
To classify a given ambiguous pixel, humans usually look around the pixel, rather than farther pixels, to look for contextual cues that help classify the given pixel correctly. From this perspective, SGSNet is more similar to human behavior than CCNet.
4.2 Experiments on Image Retrieval
In this section, we investigate the performance of our SGSNet on large-scale image retrieval task. We train networks on Google Landmarks 18 (GL18) [teichmann2019detect] and test on Revisited Oxford and Paris datasets [radenovic2018revisiting]. GL18 dataset is based on the original Kaggle challenge dataset [noh2016image]. It contains more than 1.2 million photos collected from 15,000 landmarks covering a wide-range of scenes (such as: historic cities, metropolitan areas, nature scenery, etc.). Revisited Oxford and Paris datasets are frequently-used to evaluate the performances of large-scale image retrieval methods. They improve Oxford and Paris datasets by dropping wrong annotations and adding new images, resulting in 4,993 and 6,322 images for Revisited Oxford and Paris datasets respectively. According to difficulty levels, the evaluation tasks are divided into three groups: easy, medium, and hard. In each task, the metrics are mean average precision (mAP) and mean precision at rank 10.
4.2.2 Comparisions with Other Methods
In this section, we compare SGSNet with Non-local and CCNet on image retrieval. We adopt ResNet101-GeM [radenovic2018fine] trained with the triplet loss and second-order similarity (SOS) loss [tian2019sosnet] as our baseline. ResNet101 [he2016deep] contains five fully-convolutional blocks conv1 to conv5_x. For fair comparison, we just add Non-local, CCNet and SGSNet after conv5_x respectively. Inside of Non-local, CCNet and SGSNet, the number of channels of an input feature map is reduced to
for efficient computations. We also do not use batch normalization and Relu layers. Following[ng2020solar], we report mAP of these methods in medium and hard tasks of Revisited Oxford and Paris datasets, as shown in Tab. 5.
From Tab. 5, we can see that adding SGSNet to the baseline can significantly improve accuracies of image retrieval task. Compared with Non-local and CCNet, SGSNet achieves comparable results in the medium task but superior results in the hard task. There are lots of large viewpoint changes and significant occlusions in the hard task, which will influence the aggregation of contextual information. Most of contextual information is captured around central nodes for SGSNet and thus large viewpoint changes and occlusion have relatively less impact on SGSNet. As a whole, SGSNet achieves better results. While, SGSNet takes less computation cost as shown in Sec. 4.1.2.
In this work, we propose a semi-global shape-aware network (SGSNet) which considers both feature similarity and geometric proximity for preserving object shapes when capturing long-range dependencies. Each position in the feature map captures contextual information in horizontal and vertical directions according to both similarity and proximity in a single block, and then harvests entire image contextual information by adding more semi-global blocks hierarchically. In addition, each of rows and columns on the given feature map is regarded as a binary tree. Then based on structures of the binary trees, we present a linear time algorithm further improving the efficiency of SGSNet. Extensive ablation studies have been conducted to deeply understand the proposed method. We also show the superiorities of SGSNet on semantic segmentation and image retrieval. In the future, we will explore SGSNet in more vision tasks.