1 Introduction
Deep learning, based on local convolutions, has achieved great success in many fields. Recently, many studies show that modeling longrange dependencies can attain consistent improvements in many computer vision tasks, such as semantic segmentation [huang2019ccnet, song2019learnable], object detection [wang2018non], and image retrieval [ng2020solar]. Longrange dependencies, containing global contextual information, are usually captured by using local convolutional operations [he2016deep, chen2017deeplab] or nonlocal blocks [huang2019ccnet, song2019learnable, wang2018non].
Conventional convolutions operate data in a local window and many methods enlarge receptive fields by repeating local convolutional operations for capturing longrange dependencies [simonyan2014very, szegedy2015going]. These operations are inefficient to model longrange dependencies and may encounter optimization difficulties [he2016deep]. Considering the problems, several nonlocal operations are provided to model longrange dependencies, such as Nonlocal networks [wang2018non], CCNet [huang2019ccnet], and GCNet [cao2019gcnet]. In these methods, each spatial position in a feature map is seen as a node, and each node can get supports from all other nodes according to a weight function. The weight function is usually a dotproduct similarity between central and other nodes. This way just considers similarity but ignores proximity between two nodes when capturing longrange dependencies. Due to the absence of spatial distances when capturing longrange dependencies, object shapes cannot be modeled. Actually, shapeawareness has been confirmed to be useful in many computer vision tasks [dai2017deformable, zhu2019deformable, luo2020aslfeat]. Tree filters [yang2014stereo, song2019learnable] are adopted to preserve object structures, where minimum spanning trees in lowlevel feature maps are established and then the process of global context aggregation on highlevel feature maps is performed. The efficiencies of spanning trees based methods can still be developed furthermore.
In this paper, we propose a SemiGlobal Shapeaware Network (SGSNet) to address the abovementioned problems. In order to consider both feature similarity and proximity for preserving object shapes when modeling longrange dependencies, we aggregate global contextual information for each position in a hierarchical way. In the first level, each position in the whole feature map only aggregates contextual information in vertical and horizontal directions. Then the result of previous level is input into the next level to do the same operations. By this hierarchical way, each central position gains supports from all other positions, and the combination of similarity and proximity makes each position gain supports mostly from the same semantic object. We also propose an efficient algorithm reducing the computational complexity of the brute force implementation to
. As a contrast, the computational complexities of Nonlocal Neural Networks
[wang2018non] and CCNet [huang2019ccnet] are and respectively. The computational complexity of [song2019learnable] is also while it needs extra time and computations to establish minimum spanning trees.Differences between our method and CCNet are illustrated in Fig. 1. The weight of two nodes just depends on similarity between themselves for CCNet. In our method, the weight of two nodes considers both similarity and proximity for preserving object shapes when capturing longrange dependencies. The details are described in Sec. 3.
In summary, our main contributions are:

We propose a SGSNet which considers both feature similarity and geometric proximity for preserving object shapes when modeling longrange dependencies.

For practical usages, we propose an efficient algorithm for the aggregation of contextual information which reduces the computational complexity to
. Thus SGSNet can be plugged into existing convolutional neural networks conveniently.

Experiments on different computer vision tasks (semantic segmentation and image retrieval) show that adding SGSNet to existing deep neural networks achieves higher accuracies while takes less computations.
2 Related Work
2.1 Selfattention
Selfattention [vaswani2017attention] was initially applied in machine translation area. Nonlocal Networks [wang2018non] bridged selfattention modules in machine translation to nonlocal filtering operations in computer vision. Many methods improved weight functions in selfattention module to learn discriminative feature representation. Nonlocal Networks [wang2018non] discussed four choices of weight functions, i.e. Gaussian, Embedded Gaussian, Dot product and Concatenation. Considering the computational complexity of Nonlocal Networks, CCNet [huang2019ccnet] adopted recurrent sparse crisscross attention modules to substitute the dense attention in Nonlocal Networks. By two consecutive crisscross attention modules, every node could collect contextual information from all nodes in feature maps. This way reduced the computational complexity for Nonlocal Networks to , therefore it was more efficient, while object shapes could be further considered. Song et al. [song2019learnable] first built minimum spanning trees (MST) in lowlevel feature maps and then computed feature similarity in highlevel semantics to preserve object structures. However extra time and computations to build the minimum spanning trees were needed. In this work, we consider both spatial distances and feature similarity when capturing longrange dependencies for higher accuracies meanwhile maintaining a low computational complexity.
2.2 Semantic Segmentation
Semantic segmentation is an essential and challenging task in computer vision community. The methods based on Convolution Neural Networks (CNNs) have made significant achievements in the past few years. Long et al. [long2015fully] applied fully convolutional networks (FCNs) in semantic segmentation. Later, researchers found that FCNs were limited by receptive fields due to fixed geometric structures. The works of Unet [ronneberger2015u] and DeepLabv3+ [chen2018encoder] used encoderdecoder structures to combine highlevel and lowlevel features for semantic segmentation tasks. Chen et al. [chen2017deeplab] adopted atrous convolutions which could effectively enlarge receptive fields when aggregating contextual information. Furthermore, they also proposed atrous spatial pyramid pooling (ASPP) to complete the segmentation task at multiple scales. Zhao et al. [zhao2017pyramid] exploited global context information by pyramid pooling modules to achieve better performance. PSANet [zhao2018psanet] relaxed the local neighborhood constraint through a selfadaptively learned attention mask. GCN [peng2017large] found that large convolutional kernels were also important when performing a dense perpixel prediction task and proposed Global Convolutional Network (GCN). Unlike the previous methods, we aggregate global context information in a selfattention manner.
2.3 Image Retrieval
Traditionally, bagofvisualwords [philbin2007object], VLAD [arandjelovic2013all]
and Fisher vector
[perronnin2010large] are usually used to aggregate a set of handcrafted local features [lowe2004distinctive] into a single global vector to represent an image. Recently, many methods attempt to replace handcrafted ones by learned counterparts and then aggregate learned features with these similar techniques as the traditional ones [arandjelovic2016netvlad, liu2019stochastic, ge2020self]. Some studies show that directly using a pooling operation [radenovic2018fine] to substitute the aggregation process can get comparable performances. We follow [ng2020solar] to name these methods with a pooling operation as global singlepass methods for they do not separate extraction and aggregation steps explicitly. Radenović et al. [radenovic2018fine]proposed GeM pooling which generalized average and max pooling and got excellent results. Based on GeM, SOLAR
[ng2020solar] employed secondorder similarity and attention for image retrieval and obtained significant performance improvements. In this paper, we also explore the proposed SGSNet components for image retrieval.3 SemiGlobal Shapeaware Network
In this section, we first introduce preliminary formulations which consider feature similarity and geometric proximity for preserving object shapes when modeling longrange dependencies. Then we present the proposed semiglobal shapeaware Network. Finally, we propose a linear time algorithm for implementing the network.
3.1 Preliminary Formulations
First, a given feature map in CNNs can be seen as a connected, undirected graph . The nodes are all spatial positions in and the edges with weights are all connections between two neighbor nodes. Next, we define a weight function between different nodes. Let us consider a simple case: given a pair of neighbor nodes and , the weight between and is defined as
(1) 
where is Euclidean distance between feature vectors of and . When nodes and are not neighbors, the weight function is given by
(2) 
where and are neighbor nodes in the shortest path between and . We denote the feature map after aggregating contextual information as . The aggregation function which takes feature similarity and geometric proximity into account simultaneously is:
(3) 
where and denotes the feature vector at in and respectively, is all nodes in the graph , the function means feature transformation and the weight is normalized by .
In the above aggregation process, a problem is that there may be a lot of paths between arbitrary nodes and and it is timeconsuming to find the shortest path for all pairs of nodes. In [yang2014stereo, song2019learnable], the authors adopt a spanning tree to remove ”unwanted” edges in the fourconnected planner graph , thus all the nodes are connected by a minimum spanning tree. The minimum spanning tree can enlarge the geometric proximity between two neighbor nodes if these two nodes are dissimilar in appearance. As a result, low support weights will be assigned between these two nodes which causes less robustness to textures [yang2014stereo]. Besides, establishing and traversing a minimum spanning tree needs to pay extra time and computations. Inspired by [hirschmuller2007stereo, huang2019ccnet], we adopt a semiglobal manner to overcome this difficulty by considering the supports from nodes in horizontal and vertical directions in a single block, and then add multiple blocks hierarchically for capturing fullimage dependencies. The details are described in Sec. 3.2.
3.2 A SemiGlobal Block
Due to the difficulty of minimizing matching costs in a 2D image, SGM [hirschmuller2007stereo] utilizes a semiglobal approach to aggregate the matching costs along multiple 1D directions. CCNet [huang2019ccnet] just collects contextual information in a crisscross path. In this paper, we establish a semiglobal block, as shown in Fig. 2 (a). Given an input feature map with channels , width and height , the block first utilizes a 1 1 convolution on to reduce . The resulting feature map is denoted as with . We compute an attention map by a weight function on . Channels at a position of correspond to the weights between this position and other positions in the same column or row respectively. The block also applies another 1 1 convolution on and the resulting feature map has the same size as . For each position in the spatial dimension of , we can obtain a feature vector . We collect the feature vectors of all positions in horizontal and vertical directions of on , resulting in a matrix . We denote the output feature map of the block as , and the feature vector at position on by (3) is:
(4) 
where is the index of channels of , denotes the value at the channel in position on . denotes the feature vector in , and denotes the feature vector at position on .
It is worth noting that the weight function is quite different from [huang2019ccnet]. The aggregation process is shown in Fig. 2 (b), where the weight between nodes and depends on all nodes in the path connecting and . The path is limited in horizontal and vertical directions. Actually, our weight function (2) in the aggregation process considers feature similarity and geometric proximity simultaneously, as described in Sec 3.1. Therefore, our method can preserve object shapes more clearly compared with [huang2019ccnet], which will be shown in Sec. 4.1.4. In order to adjust feature similarity between nodes and , we further add learnable parameters and in (2):
(5) 
where , and are neighbor nodes in horizontal or vertical path connecting and . Considering that width and height of feature maps may be different, we use and for horizontal and vertical directions respectively. If () is smaller, feature similarity plays a more important role in (5), and vice versa. Therefore, the learnable parameter () can adjust the relation of feature similarity and geometric proximity adaptively.
3.3 Hierarchical SemiGlobal Blocks
A single semiglobal block can only aggregate contextual information in the same row and column of a central node but ignores other directions. We address this problem by multiple hierarchical semiglobal blocks. In the first hierarchical level, a feature map is input into a semiglobal block, producing a feature map . Then in the second hierarchical level, another semiglobal block takes as the input and outputs a feature map . Thus, every node in can aggregate contextual information from all nodes in . Specifically, we denote the attention maps in the first and second hierarchical levels as and . The process of spreading contextual information in node (in purple) to node which is not in the same row and column as is illustrated in Fig. 3. In the first semiglobal block, only nodes in the same row and column can receive the contextual information from , which can be seen at in Fig. 3. In the second semiglobal block, receives contextual information from or which have aggregated contextual information from in the first semiglobal block, as shown at in Fig. 3. Thus, obtains contextual information of through the above process.
More generally, every node can capture fullimage contextual information by the hierarchical semiglobal blocks, which can enhance the capability of a single semiglobal block for modeling longrange dependencies [huang2019ccnet].
3.4 A Linear Time Algorithm for Efficient Implementation
For a clearer explanation, we ignore 1 1 convolution and in Fig. 2, which does not influence our conclusion. In a semiglobal block, one node is supposed to aggregate contextual information of those in vertical and horizontal directions. According to (4), the total computational complexity of a single semiglobal block is by the brute force implementation. Specifically, given an input feature map with nodes, every node needs to aggregate contextual information from other nodes in the same row and column. Thus, we need computation times of similarity for every node. The total computation is , which is unfavorable for practical applications.
Noticing that there are extensive repeated computations in the brute force implementation, we propose an optimized alternative which reduces the above computational complexity of a single semiglobal block to . We innovatively regard each of rows and columns on the given feature map as a binary tree and complete interactions between any two nodes on the binary tree in linear time or . Thus total computational complexity is for columns and rows. Specifically, let us consider operations on one of the rows or columns, as shown in Fig. 4. We select one node arbitrarily in the given row or column as a root node to establish a binary tree. Except the root node and leaf nodes, every node in the binary tree has a parent node and a child node. Every node is supposed to capture contextual information of others in this binary tree. The process of capturing contextual information is split into aggregation and updating steps:

We aggregate contextual information from leaf nodes towards the root node recursively according to
(6) where is the input feature vector at node , means is the parent node of . is the defined weight function between and in (5).

We update features from the root node towards leaf nodes recursively by
(7) where is computed by (6).
The computational complexity of the aggregation and that of updating steps are both for columns or for rows. Thus the total computational complexity is or for a single binary tree.
Operations on other columns and rows are similar. We do not perform updating steps until the aggregation steps are completed for all binary trees. Thus, the aggregation and updating steps of each binary tree are independent of others, which can be parallel accelerated on GPUs. The output feature map is denoted as . A node is included in two binary trees (), as shown in Fig. 4. We add the updating results of in and to :
(8) 
where is the feature vector at in , and has contained contextual information in horizontal and vertical directions of in .
4 Experiments
In this section, we evaluate SGSNet on semantic segmentation and image retrieval. We first conduct extensive experiments on semantic segmentation to understand the behavior of SGSNet, and then extend SGSNet to image retrieval to demonstrate the generality of SGSNet.
4.1 Experiments on Semantic Segmentation
We adopt Cityscapes [cordts2016cityscapes] benchmark for semantic segmentation and report the metrics of Mean IoU (mean of classwise intersection over union). Cityscapes is a semantic segmentation benchmark focusing on urban street scenes. This benchmark contains 5,000 finely annotated images which are divided into 2975, 500, 1525 images as training, validation and testing set. A larger set of 20,000 coarsely annotated images are also provided for supporting methods that exploit large volumes of weaklylabeled data. The dataset also defines 30 visual classes, of which 19 classes are used in our experiments.
Based on the proposed linear time algorithm, SGSNet can be plugged into any deep neural networks conveniently for capturing longrange dependencies. We adopt ResNet101 [he2016deep]
(pretrained on ImageNet) as our backbone with minor changes. The last two downsampling layers are dropped and dilated convolutions are embedded into subsequent convolutional layers similar to
[huang2019ccnet].We use the minibatch stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 0.0001. A poly learning rate schedule with an initial learning rate of 0.01 and power of 0.9 is employed. We also augment the training set by randomly scaling (0.75 to 2.0
) and then crop out patches with 769 769 pixels as the network input resolution. The learnable parameter and are both initialized with 1. The number of channels in is one eighth of the number of channels in and we share the parameters of the hierarchical semiglobal blocks.4.1.1 Comparisons with Other Methods
We first train SGSNet on Cityscapes training set and present results on Cityscapes validation set in Tab. 1. We can see that our method achieves better results than other methods, even if DeepLibv3+ [chen2018encoder] and DPC [chen2018searching] use a more powerful backbone.
We further compare our method with existing methods on the Cityscapes testing set, as shown in Tab. 2. SGSNet is trained with only finely annotated data and then the test results are submitted to the official evaluation server. It can be seen from Tab. 2 that SGSNet outperforms other methods no matter that they employ either the same backbone as ours or a stronger one [yuan2018ocnet, yang2018denseaspp]
. Even if CCNet freezes batch normalization (BN) layers and finetunes the model with a low learning rate after training certain iterations while we do not use any of those tricks, our method is still better than CCNet. This proves the importance of considering both feature similarity and geometric proximity for preserving object shapes when capturing longrange dependencies. Furthermore, our SGSNet is more efficient, which will be elaborated in Sec.
4.1.2.Method  Backbone  multiscale  mIoU(%) 
DeepLabv3 [chen2017rethinking]  ResNet101  Yes  79.3 
DeepLabv3+ [chen2018encoder]  Xception65  No  79.1 
DPC [chen2018searching]  Xception71  No  80.8 
CCNet [huang2019ccnet]  ResNet101  Yes  81.3 
SGSNet  ResNet101  No  80.9 
SGSNet  ResNet101  Yes  81.9 
Method  Backbone  mIoU(%) 

DeepLabv2 [chen2017deeplab]  ResNet101  70.4 
RefineNet [lin2017refinenet]  ResNet101  73.6 
SAC [zhang2017scale]  ResNet101  78.1 
GCN [peng2017large]  ResNet101  76.9 
DUC [wang2018understanding]  ResNet101  77.6 
ResNet38 [yuan2018ocnet]  WiderResnet38  78.4 
PSPNet [zhao2017pyramid]  ResNet101  78.4 
BiSeNet [yu2018bisenet]  ResNet101  78.9 
AAF [ke2018adaptive]  ResNet101  79.1 
PSANet [zhao2018psanet]  ResNet101  80.1 
DFN [yu2018learning]  ResNet101  79.3 
DenseASPP [yang2018denseaspp]  DenseNet161  80.6 
TF [song2019learnable]  ResNet101  80.8 
CCNet [huang2019ccnet]  ResNet101  81.4 
SGSNet  ResNet101  82.1 
4.1.2 Analysis of Computational Complexity
Based on the linear time algorithm described in Sec. 3.4, the computational complexity of SGSNet is , which is linearly proportional to the number of nodes in a feature map. We explore the computation cost and the number of parameters of SGSNet on Cityscapes validation set. We also use ResNet101 as the backbone and the input size is 769 769 pixels, thus the size of input feature maps of the hierarchical semiglobal blocks is 97 97 pixels. The baseline network is ResNet101 with some minor changes, where dilated convolutional layers are adopted at stage 4 and 5. As shown in Tab. 3, SGSNet improves the performance by 5.8% mIoU over the baseline with additional 0.295M parameters and 11.4G FLOPs overheads. We also list the GFLOPs and parameters of Nonlocal and CCNet in Tab. 3. It can be seen that SGSNet achieves higher performance while uses less parameters and FLOPs than CCNet. Nonlocal uses slightly less parameters than SGSNet, but the additional FLOPs of Nonlocal is far greater (108G VS 11.4G) than SGSNet. Also, SGSNet increases mIoU by 1.8% compared with Nonlocal. Therefore, SGSNet is more efficient and effective than the other two.
Method  GFLOPs ()  Params (M )  mIoU(%) 

baseline  0  0  75.1 
Nonlocal  108  0.131  79.1 
CCNet  16.5  0.328  79.8 
SGSNet  11.4  0.295  80.9 
Comparisons of SGSNet with CCNet and Nonlocal. The increments of FLOPs and parameters are estimated for an input of 1
3769769 pixels.4.1.3 Ablation Studies
In order to further understand the behavior of SGSNet, we implement substantial ablation studies on Cityscapes validation set. The increments of FLOPs and parameters with different numbers of the hierarchical semiglobal blocks are listed in Tab. 4. Adding a semiglobal block increases 4.4% mIoU compared with the baseline network, which indicates importance of the semiglobal block. We further add two semiglobal blocks hierarchically and the performance is increased by another 1.4% mIoU. These results demonstrate that the proposed hierarchical semiglobal blocks can capture fullimage contextual information and significantly improve the performance. We believe that the performance can be further improved by adding more semiglobal blocks hierarchically. We also can see from Tab. 4 that a single semiglobal block just increases 5.70G FLOPs and 0.295M parameters. Parameters do not increase more when adding the second block because they share parameters with each other. Qualitative results are also given in Fig. 5
. Areas (truck, ground and fence) in red circles of the first column images are easily classified erroneously. Those areas can’t be classified correctly by just adding one semiglobal block, as shown at the second column in Fig.
5. But when we add two semiglobal blocks hierarchically, those areas are rectified, as shown at the following third column, which explicitly demonstrates the advantages of hierarchical semiglobal blocks.We also explore the influence of learnable parameters and . We use two semiglobal blocks and remove and in (5). The results are listed at the fourth row in Tab. 4. As we can see, the performance is dropped by 0.7% mIoU (compared with the last line in Tab. 4) due to the absence of and .
Hierarchical  GFLOPs ()  Params (M )  mIoU(%) 
baseline  0  0  75.1 
H=1  5.70  0.295  79.5 
H=2(no )  11.4  0.295  80.2 
H=2  11.4  0.295  80.9 

4.1.4 Visualization of Attention Module
To further elaborate what SGSNet has learned, we visualize the learned attention maps of SGSNet in the last column of Fig. 6. For each image in the left column, we choose a specific position (marked by a cross in red) and show its corresponding attention map of SGSNet in the right column. Images in the middle column are the corresponding attention maps of CCNet. Both SGSNet and CCNet use two blocks. We can observe that, compared with CCNet, SGSNet can capture clear semantic similarity and learn object shapes simultaneously. For example, in the first image, the marked position in truck obtains almost all high responses from positions in truck and we can see the outline of the truck in the corresponding attention map of SGSNet clearly. Similar phenomena can be seen at the marked positions (in the building and traffic sign) of the second and third images. Areas in same objects are activated and have the high responses, which means that the proposed SGSNet can be aware of object shapes when capturing longrange dependencies.
To classify a given ambiguous pixel, humans usually look around the pixel, rather than farther pixels, to look for contextual cues that help classify the given pixel correctly. From this perspective, SGSNet is more similar to human behavior than CCNet.
4.2 Experiments on Image Retrieval
4.2.1 Datasets
In this section, we investigate the performance of our SGSNet on largescale image retrieval task. We train networks on Google Landmarks 18 (GL18) [teichmann2019detect] and test on Revisited Oxford and Paris datasets [radenovic2018revisiting]. GL18 dataset is based on the original Kaggle challenge dataset [noh2016image]. It contains more than 1.2 million photos collected from 15,000 landmarks covering a widerange of scenes (such as: historic cities, metropolitan areas, nature scenery, etc.). Revisited Oxford and Paris datasets are frequentlyused to evaluate the performances of largescale image retrieval methods. They improve Oxford and Paris datasets by dropping wrong annotations and adding new images, resulting in 4,993 and 6,322 images for Revisited Oxford and Paris datasets respectively. According to difficulty levels, the evaluation tasks are divided into three groups: easy, medium, and hard. In each task, the metrics are mean average precision (mAP) and mean precision at rank 10.
Method  Medium  Hard  
ROxf  RPar  ROxf  RPar  
baseline  67.6  80.9  44.9  61.9 
+Nonlocal  68.8  82.0  46.6  64.5 
+CCNet  68.8  81.8  46.8  64.7 
+SGSNet  68.8  82.0  47.1  64.8 
4.2.2 Comparisions with Other Methods
In this section, we compare SGSNet with Nonlocal and CCNet on image retrieval. We adopt ResNet101GeM [radenovic2018fine] trained with the triplet loss and secondorder similarity (SOS) loss [tian2019sosnet] as our baseline. ResNet101 [he2016deep] contains five fullyconvolutional blocks conv1 to conv5_x. For fair comparison, we just add Nonlocal, CCNet and SGSNet after conv5_x respectively. Inside of Nonlocal, CCNet and SGSNet, the number of channels of an input feature map is reduced to
for efficient computations. We also do not use batch normalization and Relu layers. Following
[ng2020solar], we report mAP of these methods in medium and hard tasks of Revisited Oxford and Paris datasets, as shown in Tab. 5.From Tab. 5, we can see that adding SGSNet to the baseline can significantly improve accuracies of image retrieval task. Compared with Nonlocal and CCNet, SGSNet achieves comparable results in the medium task but superior results in the hard task. There are lots of large viewpoint changes and significant occlusions in the hard task, which will influence the aggregation of contextual information. Most of contextual information is captured around central nodes for SGSNet and thus large viewpoint changes and occlusion have relatively less impact on SGSNet. As a whole, SGSNet achieves better results. While, SGSNet takes less computation cost as shown in Sec. 4.1.2.
5 Conclusion
In this work, we propose a semiglobal shapeaware network (SGSNet) which considers both feature similarity and geometric proximity for preserving object shapes when capturing longrange dependencies. Each position in the feature map captures contextual information in horizontal and vertical directions according to both similarity and proximity in a single block, and then harvests entire image contextual information by adding more semiglobal blocks hierarchically. In addition, each of rows and columns on the given feature map is regarded as a binary tree. Then based on structures of the binary trees, we present a linear time algorithm further improving the efficiency of SGSNet. Extensive ablation studies have been conducted to deeply understand the proposed method. We also show the superiorities of SGSNet on semantic segmentation and image retrieval. In the future, we will explore SGSNet in more vision tasks.
Comments
There are no comments yet.