Spatial Pyramid Based Graph Reasoning for Semantic Segmentation

03/23/2020 ∙ by Xia Li, et al. ∙ Peking University 2

The convolution operation suffers from a limited receptive filed, while global modeling is fundamental to dense prediction tasks, such as semantic segmentation. In this paper, we apply graph convolution into the semantic segmentation task and propose an improved Laplacian. The graph reasoning is directly performed in the original feature space organized as a spatial pyramid. Different from existing methods, our Laplacian is data-dependent and we introduce an attention diagonal matrix to learn a better distance metric. It gets rid of projecting and re-projecting processes, which makes our proposed method a light-weight module that can be easily plugged into current computer vision architectures. More importantly, performing graph reasoning directly in the feature space retains spatial relationships and makes spatial pyramid possible to explore multiple long-range contextual patterns from different scales. Experiments on Cityscapes, COCO Stuff, PASCAL Context and PASCAL VOC demonstrate the effectiveness of our proposed methods on semantic segmentation. We achieve comparable performance with advantages in computational and memory overhead.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Networks (CNNs) based architectures have revolutionized a wide range of computer vision tasks [20, 48, 5, 38]. Despite the huge success, convolutional operations suffer from a limited receptive filed, so they can only capture local information. Only with layers stacked as a deep model, can convolution networks have the ability to aggregate rich information of global context. However, it is an inefficient way since stacking local cues cannot always precisely handle long-range context relationships. Especially for pixel-level classification problems, such as semantic segmentation, performing long-range interactions is an important factor for reasoning in complex scenarios [5, 6]. For examples, it is prone to assign visually similar pixels in a local region into the same category. Meanwhile, pixels of the same object but distributed with a distance are difficult to construct dependencies.

Several approaches have been proposed to address the problem. Convolutional operations are reformulated with dilation [51] or learnable offsets [12] to augment the spatial sampling locations. Non-local network [46] and double attention network [9]

try to introduce new interaction modules that sense the whole spatial-temporal space. They enlarge the receptive region and enable capturing long-range dependencies within deep neural networks. Recurrent neural networks (RNNs) can also be employed to perform long-range reasoning

[16, 43]. However, these methods learn global relationships implicitly and rely on dense computation. Because graph-based propagation has the potential benefits of reasoning with explicit semantic meaning stored in graph structure, graph convolution [24] recently has been introduced into high-level computer vision tasks [28, 29, 10]. These methods first transform the grid-based CNN features into graph representation by projection, and then perform graph reasoning with graph convolution proposed in [24]. Finally, these node features are re-projected back into the original space. The projection and re-projection processes try to build connections between coordinate space and interaction space, but introduce much computation overhead and damage the spatial relationships.

As illustrated in Figure 1, in this paper, we propose an improved Laplacian formulation for graph reasoning that is directly performed in the original CNN feature space organized as a spatial pyramid. It gets rid of projection and re-projection processes, making our proposed method a light-weight module jointly optimized with the network training. Performing graph reasoning directly in the original feature space retains the spatial relationships and makes spatial pyramid possible to sufficiently exploit long-range semantic context from different scales. We name our proposed method as Spatial Pyramid Based Graph Reasoning (SpyGR) layer.

Figure 1: A diagram of our model with graph reasoning on spatial pyramid for the semantic segmentation task. The graph reasoning is directly performed in the original feature space. Multiple long-range contextual patterns are captured from different scales.

Initially, graph convolution was introduced to extract representation in non-Euclidean space, which cannot be handled well by current CNN architectures [2]. It seems that graph propagation should be performed on graph-structured data, which motivates the construction of semantic interaction space in [28, 29, 10]. Actually we note that image features can be regarded as a special case of data defined on a simple low-dimensional graph [21]. When the graph structure of input is known, i.e., the Laplacian matrix is given, the graph convolution [24] essentially performs a special form of Laplacian smoothing on the input, making each new vertice feature as the average of itself and connected neighbors [26]

. But for the case that graph structure is not given, as seen in CNN features, the graph structure can be estimated with the similarity matrix from the data

[21], which achieves a similar goal with the projection process adopted in [28, 29, 10]. Different from their work where the Laplacian is a learnable data-independent matrix, in this study, we modify the Laplacian as a data-dependent similarity matrix, and introduce a diagonal matrix that performs channel-wise attention on the inner product distance. The Laplacian ensures that the long-range context pattern to learn is dependent on the input features and not restricted as a specific one. Our method spares the computation to construct an interaction space by projecting. More importantly, it retains the spatial relationships to facilitate exploiting long-range context from multi-scale features.

Spatial pyramid contains multi-scale contextual information that is important for dense prediction tasks [51, 56, 35]. For graph-structured data, multi-scale scheme is also the key to build hierarchical representation and enable the model to be invariant with scale changes [49, 32]. Global context owns multiple long-range contextual patterns that can be better captured from features of different sizes. The finer representation has more detailed long-range context, while the coarser representation could provide more global relationships. Because our method is able to perform graph reasoning directly in the original feature space, it is possible to build a spatial pyramid to further extend the long-range contextual patterns that our method can model.

The SpyGR layer is light-weight and can be plugged into CNN architectures easily. It efficiently extracts long-range context without introducing much computational overhead. The contributions in this study are listed as follows:

  • We propose an improved Laplacian formulation that is data-dependent, and introduce a diagonal matrix with position-agnostic attention on the inner product to enable a better distance metric.

  • The Laplacian is able to perform graph reasoning in the original feature space, and makes spatial pyramid possible to capture multiple long-range contextual patterns. We develop a computing scheme that effectively reduces the computational overhead.

  • Experiments on multiple datasets, including PASCAL Context, PASCAL VOC, Cityscapes and COCO Stuff, show the effectiveness of our proposed methods for the semantic segmentation task. We achieve top performance with advantages in computational and memory overhead.

2 Related Work

Semantic segmentation. Fully convolutional network (FCN) [38] has been the basis of semantic segmentation with CNNs. Because details are important for dense classification problems, different methods are proposed to generate desired spatial resolution and keep object details. In [40], deconvolution [52] is employed to learn finer representation from low-resolution feature maps, while SegNet [1] achieves this purpose using an encoder-decoder structure. U-Net [41] adds a skip connection between the down-sampling and up-sampling paths. RefineNet [34] introduce a multi-path refinement network that further exploits the finer information along the down-sampling path.

Another stream aims to enhance multi-scale contextual information aggregation. In [17], input images are constructed as a Laplacian pyramid and each scale is fed into a deep CNN model. ParseNet [36] introduces image-level features to augment global context. DeepLabv2 [5] proposes the atrous spatial pyramid pooling (ASPP) module that consists of parallel dilated convolutions with variant dilation rates. PSPNet [56] performs spatial pyramid pooling to collect contextual information of different scales. DeepLabv3 [6] employs ASPP module on image-level features to better aggregate global context.

Other methods that model global context include formulating advanced convolution operations [12, 46, 9], relying on attention mechanisms [7, 53, 57, 18], and introducing Conditional Random Field (CRF) [4, 58, 37] or RNN variants [30, 16, 43] to build long-range dependencies. Still, it needs further efforts to explore how to model global context more efficiently, and perform reasoning explicitly with the semantic meanings.

Graph convolution. Graph convolution was initially introduced as a graph analogue of the convolutional operation [2]. Later studies [13, 24]

make approximations on the graph convolution formulation to reduce the computational cost and training parameters. It provides the basis of feature embedding on graph-structured data for semi-supervised learning

[24, 26], node or graph classification [44, 49, 54], and molecule prediction [27]. Due to the ability of capturing global information in graph propagation, the graph reasoning is introduced for visual recognition tasks [28, 29, 10]. These methods transform the grid-based feature maps into region-based node features via projection. Different from these studies, our method notes that the graph reasoning can be directly performed in original feature space, once the learnable Laplacian matrix is data dependent. It spares the computation of projection and re-projection, and retains the spatial relationships in the graph reasoning process.

Feature pyramid. Feature pyramid is an effective scheme to capture multi-scale context. It is widely adopted in dense prediction tasks such as semantic segmentation [5, 56] and object detection [35, 19]. Hierarchical representation is also shown to be useful for embedding on graph-structured data [49]. Different from the pyramid pooling module in [5], we build our spatial pyramid simply by down-sampling and up-sampling processes on the final predicting feature maps. We directly perform graph reasoning on each of the scale and aggregate them in order to capture sufficient long-range contextual relationships in the final prediction.

3 Our Methods

In this section, we first briefly introduce the background of graph convolution, and then develop our method in detail. Finally, we analyze the complexity of our method.

3.1 Graph Reasoning on Graph Structures

Graph convolution was introduced as an analogue of convolutional operation on graph-structured data. Given graph and its adjacency matrix and degree matrix , the normalized graph Laplacian matrix is defined as:

. It is a symmetric positive semi-definite matrix and has a complete set of eigenvectors

formed by , where is the number of vertices. The Laplacian of graph can be diagonalized as

. Then we have graph Fourier transform

, which transforms the graph signal into spectral domain spanned by basis .

Generalizing the convolution theorem into structured space on graph, convolution can be defined through decomposing a graph signal on the spectral domain and then applying a spectral filter [2]. Naive implementation requires explicitly computing the Laplacian eigenvectors. To circumvent this problem, later study [13] approximated the spectral filter with Chebyshev polynomials up to order, i.e., , and then convolution of the graph signal can be formulated as:

(1)

where is the Chebyshev polynomials and

is a vector of Chebyshev coefficients. In

[24], the formulation is further simplified by limiting

, and approximating the largest eigenvalue of

by . In this way, the convolution becomes:

(2)

with being the only Chebyshev coefficient left. They further introduce a normalization trick:

(3)

where and . Generalizing the convolution to a graph signal with channels, the layer-wise propagation rule in a multi-layer graph convolutional network (GCN) is given by [24]:

(4)

where is the vertices features of the -th layer, is the trainable weight matrix in layer , and

is the non-linear activation function.

The Eq (4) provides the basis of performing convolution on graph-structured data, as adopted in [54, 49]. For visual recognition tasks, in order to overcome the limited receptive field in current CNN architectures, some recent studies transform feature maps into region-based representation by projecting, and then perform graph reasoning with Eq (4) to capture global relationships [28, 29, 10].

3.2 Graph Reasoning on Spatial Features

Assuming that the propagation rule in Eq (4) is applied on CNN features, i.e., , the only difference between a GCN layer and a convolution layer is the graph Laplacian matrix applied on the left of . In our study, we note that the original grid-based feature space can be deemed as a special case of data defined on a simple low-dimensional graph [21]. Besides, the projecting process in current methods [28, 29, 10] actually achieves a similar purpose with the graph Laplacian matrix. They perform left multiplication on the input feature using a similarity matrix to have a global perception among all spatial locations. Therefore, we directly perform our graph reasoning in the original feature space. We save the projecting and re-projecting processes, and perform left matrix multiplication on the input feature only once.

The Laplacian matrices in most current studies are data-independent parameters to learn. In order to better capture intra spatial structure, we propose an improved Laplacian that ensures the long-range context pattern to learn is dependent on the input features and not restricted as a specific one. It is formulated with the symmetric normalized form:

(5)

where , , and is the data-dependent similarity matrix. We set , where denotes the number of spatial locations of the input feature.

For similarity matrix , Euclidean distance can be used to estimate the graph structure as suggested in [21]. We choose dot-product distance to calculate

, because dot product has a more friendly implementation in current deep learning platforms. The similarity between position

and is expressed as:

(6)

where

is a linear embedding followed by ReLU

non-linearity, is the reduced dimension after transformation, and is a diagonal matrix that has position-agnostic attention on the inner product. It essentially learns a better distance metric for the similarity matrix . Both and are data-dependent. Concretely, is implemented as a convolution, and is implemented in a similar way as the channel-wise attention proposed in [22]. We calculate as:

(7)

where is the feature after global pooling, and is another linear embedding with convolution that reduce the dimension from to

. It is followed by the sigmoid function.

The computation procedures of is shown in Figure 2, and we have its formulation as follows:

(8)

where and

are learnable parameters for the linear transformations. Because the degree matrix

in Eq (5) has a function of normalization, we do not perform softmax on the similarity matrix . Then we formulate the graph reasoning in our model as:

(9)

where is the input feature, is a trainable weight matrix, is the ReLU activation function, and is the output feature.

Figure 2: The computation procedures of the similarity matrix from the input feature .

3.3 Graph Reasoning on Spatial Pyramid

Although graph reasoning is capable of capturing global context, we note that the same image contains multiple long-range contextual patterns. For examples, the finer representation may have more detailed long-range context, while the coarser representation provide more global dependencies. Since our graph reasoning module is directly performed in the original feature space, we organize the input feature as a spatial pyramid to extend the long-range contextual patterns that our method can capture.

As shown in Figure 1, graph reasonings are performed on each scale acquired by down-sampling, and then the output features are combined through up-sampling. It has a similar form with the feature pyramid network in [35]. But we implement our method on the final predicting feature, instead of the multi-scale features from the CNN backbone. Our graph reasoning on spatial pyramid can be expressed as follows:

(10)

where denotes the graph reasoning with Eq (9), denotes the level of scales, and represents the up-sampling and down-sampling operators, respectively. We implement

using max-pooling with stride of

, and

simply by bilinear interpolation.

3.4 Complexity Analysis

In region-based graph reasoning studies [28, 29, 10], they transform the grid-based CNN features into region-based vertices by projecting, which reduces the computational overhead for graph reasoning because the number of vertices is usually less than that of spatial locations. It seems that our method consumes more computation since we implement graph reasoning directly in the original feature space. Actually, we adopt an efficient computing strategy that successfully reduces the computational complexity of our method. We note that large computation is caused by the similarity matrix , therefore we do not explicitly calculate . Concretely, we calculate the degree matrix in Eq (5) as follow:

(11)

where denotes an all-one vector in . The brackets indicate the computation superiority. In this way, each step in Eq (11) is a multiplication with a vector, which effectively reduces the computational overhead. And then we calculate the left product of the Laplacian on the input feature as follows:

(12)

where is defined as . Correspondingly, we calculate the terms in inner bracket first. In this way, we circumvent quadratic order of computation on the spatial locations .

In our experiments, we set as , and as . Assuming that height and width of the input features are , we calculate the computational and memory cost of our proposed layer, and compare with related methods in the same settings. As shown in Table 1, for our method on single-scale input, it has low computational cost. When we have spatial pyramid on scales, the computational and memory overheads do not show drastic increment. Therefore, our SpyGR layer does not introduce unbearable overhead in spite of its directly performing graph reasoning in the original feature space.

Method FLOPs (G) Memory (M)
Nonlocal [46] 14.60 1072
Net [9] 3.11 110
GloRe [10] 3.11 103
SGR [29] 6.24 118
DANet [18] 19.54 1114
SpyGR w/o pyramid 3.11 120
SpyGR 4.12 164
Table 1: Overhead of different modules with input feature in . We show the complexity of our model on single-scale feature, and on a spatial pyramid with scales in the bottom two rows of the table.

4 Experiments

4.1 Datasets and Implementation Details

To evaluate our proposed SpyGR layer, we carry out comprehensive experiments on the Cityscapes dataset [11], the PASCAL Context dataset [39] and the COCO Stuff dataset [3]

. We describe these datasets, together with implement details and loss function as follows.

Method mIoU

road

sidewalk

building

wall

fence

pole

traffic

light

traffic

sign

vegetation

terrain

sky

person

rider

car

truck

bus

train

motorcycle

bicycle

Deeplabv2 [5] 70.4 97.9 81.3 90.3 48.8 47.4 49.6 57.9 67.3 91.9 69.4 94.2 79.8 59.8 93.7 56.5 67.5 57.5 57.7 68.8
RefineNet [34] 73.6 98.2 83.3 91.3 47.8 50.4 56.1 66.9 71.3 92.3 70.3 94.8 80.9 63.3 94.5 64.6 76.1 64.3 62.2 70.0
DUC-HDC [45] 77.6 98.5 85.5 92.8 58.6 55.5 65.0 73.5 77.9 93.3 72.0 95.2 84.8 68.5 95.4 70.9 78.8 68.7 65.9 73.8
SAC [55] 78.1 98.7 86.5 93.1 56.3 59.5 65.1 73.0 78.2 93.5 72.6 95.6 85.9 70.8 95.9 71.2 78.6 66.2 67.7 76.0
DepthSeg [25] 78.2 98.5 85.4 92.5 54.4 60.9 60.2 72.3 76.8 93.1 71.6 94.8 85.2 69.0 95.7 70.1 86.5 75.7 68.3 75.5
PSPNet [56] 78.4 - - - - - - - - - - - - - - - - - - -
AAF [23] 79.1 98.5 85.6 93.0 53.8 59.0 65.9 75.0 78.4 93.7 72.4 95.6 86.4 70.5 95.9 73.9 82.7 76.9 68.7 76.4
DFN [50] 79.3 - - - - - - - - - - - - - - - - - - -
PSANet [57] 80.1 - - - - - - - - - - - - - - - - - - -
DenseASPP [47] 80.6 98.7 87.1 93.4 60.7 62.7 65.6 74.6 78.5 93.6 72.5 95.4 86.2 71.9 96.0 78.0 90.3 80.7 69.7 76.8
GloRe [10] 80.9 - - - - - - - - - - - - - - - - - - -
DANet [18] 81.5 98.6 86.1 93.5 56.1 63.3 69.7 77.3 81.3 93.9 72.9 95.7 87.3 72.9 96.2 76.8 89.4 86.5 72.2 78.2
SpyGR 81.6 98.7 86.9 93.6 57.6 62.8 70.3 78.7 81.7 93.8 72.4 95.6 88.1 74.5 96.2 73.6 88.8 86.3 72.1 79.2
Table 2: Per-class results on Cityscapes testing set. Best results are marked in bold and the second best results are underlined. It is shown that SpyGR achieves the highest performance and has superiority in most categories.

Implement Details. We use ResNet [20]

 (pretrained on ImageNet 

[14]) as our backbone. We use a convolution to reduce the channel number from 2048 to 512, and then stack SpyGR layer upon it. We set as 64 in all our experiments. Following prior works [56, 5, 6], we employ a polynomial learning rate policy where the initial learning rate is multiplied by after each iteration. Momentum and weight decay coefficients are set to 0.9 and 0.0001 respectively, and the base learning rate is set to 0.009 for all datasets. For data augmentation, we apply the common scale, cropping and flipping strategies to augment the training data. Input size is set as for Cityscapes, and

for others. The synchronized batch normalization is adopted in all experiments, together with the multi-grid 

[6] scheme. For evaluation, we use the Mean IoU metric as a common choice. We downsample for three times and have four levels in our pyramid.

Loss Function. We employ the standard cross entropy loss on both final output of our model, and the intermediate feature map output from res4b22. We set the weight over the final loss as 1 and the auxiliary loss as 0.4, following the settings in PSPNet [56].

4.2 Results on Cityscapes

We first compare our method with existing methods on the Cityscapes test set. To fairly compare with others, we train our SpyGR upon ResNet-101 with output stride as 8. Note that we only train on fine annotated data. We adopt the OHEM scheme [42] for final loss, and train the model for 80K iterations, with mini-batch size set as 8. For testing, we adopt multi-scale (0.75, 1.0, 1.25, 1.5, 1.75, 2.0) inference and flipping, and then submit the predictions to official evaluation server. Results are shown in Table 2. We can see that SpyGR shows superiority in most categories. SpyGR outperforms GloRe [10], the latest graph convolutional networks (GCN) based model, by 0.7 in mIoU. Moreover, SpyGR even outperforms DANet, a recently proposed self-attention based model, whose computation overhead and memory requirements are much higher than our proposed methods, as shown in Table 1.

4.3 Comparisons with DeepLabV3

DeepLabV3 [6] and DeepLabV3+ [8] report their results on Cityscapes by training on the fine+coarse set. In order to show the effectiveness of our proposed methods over them, we conduct detailed comparisons on both Cityscapes and PASCAL VOC. As shown in Table 3, SpyGR consistently has at least 1 mIoU gains over DeepLabV3. The advantages of SpyGR over DeepLabV3+ are more significant on PASCAL VOC than Cityscapes.

4.4 Results on COCO Stuff

For the COCO Stuff dataset, we train SpyGR with output stride of 8, and mini-batch size of 12. We train for 30K iterations on the COCO Stuff training set, around 40 epochs, which is much shorter than DANet’s 240 epochs. Multi-scale input and flipping are used for testing. The comparison on the COCO Stuff dataset is shown in Table 

4. Similar to the other two datasets, our SpyGR also outperforms other methods performance on the COCO Stuff dataset. It has a comparable result with DANet, but shows a significant superiority over SGR.

4.5 Results on PASCAL Context

We carry out experiments on the PASCAL Context dataset to further evaluate the validity of our proposed SpyGR. We train our model with mini-batch size of 16 and output stride of 16, and inference with output stride of 8. To make SpyGR operated with the same stride during both training and inference phase, we upsample C5 from ResNet-101, and concatenate it with C3, which has an output stride of 8. A convolution is appended over the concatenation of C3 and C5, and then we add our SpyGR layer. We optimize the whole network on training set of PASCAL Context for 15K iterations, around 48 epochs. As a comparison, DANet trains for 240 epoch, around 5 times of us. For evaluation on test set, we adopt the multi-scale and flipping augmentations. We show the experimental results of PASCAL Context in Table 5. It is shown that even SpyGR with ResNet-50 as backbone achieves comparable performance with SGR on ResNet-101, and outperforms MSCI [33] on ResNet-152. Furthermore, SpyGR on ResNet-101 gains higher performance than SGR+, even though SGR+ is pre-trained on the COCO Stuff dataset. And once again, SpyGR outperforms DANet by a small margin, but with much less computational overhead and memory cost, and a significantly shorter training scheduler.

Methods Cityscapes PASCAL VOC
Val Test Val Test
SS MS +Coarse SS MS Finetune
DeepLabV3 78.3 79.3 81.3 78.5 79.8 -
DeepLabV3+ 79.6 80.2 82.1 79.4 80.4 83.3
SpyGR 79.9 80.5 82.3 80.2 81.2 84.2
Table 3: Comparisons with DeepLabV3. SS means single scale, MS denotes multi-scale. +Coarse means training on fine+coarse set. Finetune means finetuning on the trainval set. To be fair, all results of compared methods are tested on their newest implementations.
Method Backbone mIoU (%)
RefineNet [34] ResNet-101 33.6
CCL [15] ResNet-101 35.7
DANet [18] ResNet-50 37.2
DSSPN [31] ResNet-101 37.3
SpyGR ResNet-50 37.5
SGR [29] ResNet-101 39.1
DANet [18] ResNet-101 39.7
SpyGR ResNet-101 39.9
Table 4: The comparison on the COCO Stuff test set.

4.6 Ablation Studies

We conduct ablation studies to explore how does each part of SpyGR contribute to the performance gain. We carry out all ablation experiments on Cityscapes over ResNet-50. For inference, we only use single-scale input image. The comparisons are listed in Table 6. We analyze each part of SpyGR as follow.

Simplest GCN. We consider the case without the attentional diagonal matrix. The similarity matrix reduces to:

(13)

Removing the identity in Laplacian, the propagation rule of graph reasoning in Eq (9) now becomes as follow:

(14)

The simplest GCN brings an increase of 1.64 in mIoU.

Method Backbone mIoU (%)
PSPNet [56] ResNet-101 47.8
DANet [18] ResNet-50 50.1
MSCI [33] ResNet-152 50.3
SpyGR ResNet-50 50.3
SGR [29] ResNet-101 50.8
CCL [15] ResNet-101 51.6
EncNet [53] ResNet-101 51.7
SGR+ [29] ResNet-101 52.5
DANet [18] ResNet-101 52.6
SpyGR ResNet-101 52.8
Table 5: The comparison on the test set of PASCAL Context. ‘+’ means pretrained on COCO Stuff.
FCN GCN Identity Pyramid mIoU
- - - - - 76.34
- - - - 77.98
- - - 78.58
- - 79.05
- 79.42
79.93
Table 6: Ablation experiments on the Cityscapes dataset.
Figure 3: Visualization of the similarity matrix of a randomly sampled location marked in green cross. The left two columns are input images and ground truths respectively. The similarity matrix of different scales in the pyramid are re-scaled in the same size and shown in the right four columns from the coarsest to the finest (from left to right). Multiple long-range contextual patterns are captured in different scales, and aggregated in the finest level. Zoom in to have a better view.
(a) Image
(b) FCN
(c) ASPP
(d) PSP
(e) SpyGR
(f) Label
Figure 4: Visualisation comparison with other methods.

With data-independent . Corresponding to Eq (6), we now introduce a diagonal matrix into the inner product of and to have a better distance metric. However, we make the diagonal matrix feature independent, which means that it is a vector of parameters to learn. It outperforms the simplest GCN by 0.60. We can see that the diagonal matrix indeed makes a better distance metric with only a few trainable parameters, and leads to a higher performance.

With data-dependent . In this case, we calculate using Eq (6), and the attention diagonal matrix becomes data-dependent by Eq (7). This mechanism works in a way similar to soft-attention. As a result, It further has a performance gain of on mIoU over the data-independent case. It is demonstrated that the attention diagonal matrix is more representative, and provides a better distance metric conditioned on the distribution of input features.

Identity. Now we recover the identity term in the Laplacian formulation, and calculate exactly following Eq (5). The identity term also plays a role of shortcut connection to facilitate optimization of graph reasoning. We see that the performance has a further increment.

Spatial Pyramid. Finally, we organize the input feature as a spatial pyramid following Eq (10), which enables capturing multiple long-range contextual patterns from different scales. It further brings a performance gain of 0.51 in mIoU.

4.7 Analysis

In order to have a better sense of the effects of our proposed spatial pyramid based graph reasoning, we visualize the similarity matrix in different scales on the Cityscapes dataset. Concretely, as shown in Figure 3, we randomly generate a sampling point and mark it by the green cross. And then we visualize the -th row of the similarity matrix, i.e., , as a heatmap. The right four columns show the similarity matrix from the coarsest level to the finest level. We can observe that, different long-range contextual patterns are captured in the spatial pyramid. For the sampling points located on the car, the strongest activations of the four scales are distributed on different cars. These different long-range relationships are finally aggregated into the finest level for prediction. This also happens to other categories such as sidewalk, bus and vegetation. For the sampling points located on the boundary line of two semantic categories, the interactions in different scales help to better assign the pixel into the right category. The aforementioned analysis shows that our proposed spatial pyramid is able to aggregate rich semantic information and capture multiple long-range contextual patterns. We also show the visualisation comparison with other methods in Figure 4.

5 Conclusion

In this paper, we aim to model long-range context using graph convolution for the semantic segmentation task. Different from current methods, we perform our graph reasoning directly in the original feature space organized as a spatial pyramid. We propose an improved Laplacian that is data-dependent, and introduce an attention diagonal matrix on the inner product to make a better distance metric. Our method gets rid of projecting and re-projecting processes, and retains the spatial relationships that enables spatial pyramid. We adopt a computing scheme to reduce the computational overhead significantly. Our experiments show that each part of our design contributes to the performance gain, and we outperform other methods without introducing more computational or memory consumption.

6 Acknowledgement

Zhouchen Lin is supported by National Natural Science Foundation (NSF) of China (grant no.s 61625301 and 61731018), Major Scientific Research Project of Zhejiang Lab (grant no.s 2019KB0AC01 and 2019KB0AB02), Beijing Academy of Artificial Intelligence, and Qualcomm. Hong Liu is supported by NSF China (grant no. U1613209) and NSF Shenzhen (grant no. JCYJ20190808182209321).

References

  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. TPAMI 39 (12), pp. 2481–2495. Cited by: §2.
  • [2] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2013) Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §1, §2, §3.1.
  • [3] H. Caesar, J. Uijlings, and V. Ferrari (2018) Coco-stuff: thing and stuff classes in context. In CVPR, pp. 1209–1218. Cited by: §4.1.
  • [4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, Cited by: §2.
  • [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), pp. 834–848. Cited by: §1, §2, §2, §4.1, Table 2.
  • [6] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §1, §2, §4.1, §4.3.
  • [7] L. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille (2016) Attention to scale: scale-aware semantic image segmentation. In CVPR, pp. 3640–3649. Cited by: §2.
  • [8] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pp. 801–818. Cited by: §4.3.
  • [9] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng (2018) A^ 2-nets: double attention networks. In NIPS, pp. 350–359. Cited by: §1, §2, Table 1.
  • [10] Y. Chen, M. Rohrbach, Z. Yan, S. Yan, J. Feng, and Y. Kalantidis (2018) Graph-based global reasoning networks. arXiv preprint arXiv:1811.12814. Cited by: §1, §1, §2, §3.1, §3.2, §3.4, Table 1, §4.2, Table 2.
  • [11] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In CVPR, pp. 3213–3223. Cited by: §4.1.
  • [12] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: §1, §2.
  • [13] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pp. 3844–3852. Cited by: §2, §3.1.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §4.1.
  • [15] H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang (2018) Context contrasted feature and gated multi-scale aggregation for scene segmentation. In CVPR, pp. 2393–2402. Cited by: Table 4, Table 5.
  • [16] H. Fan, P. Chu, L. J. Latecki, and H. Ling (2018) Scene parsing via dense recurrent neural networks with attentional selection. arXiv preprint arXiv:1811.04778. Cited by: §1, §2.
  • [17] C. Farabet, C. Couprie, L. Najman, and Y. LeCun (2013) Learning hierarchical features for scene labeling. TPAMI 35 (8), pp. 1915–1929. Cited by: §2.
  • [18] J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu (2018) Dual attention network for scene segmentation. arXiv preprint arXiv:1809.02983. Cited by: §2, Table 1, Table 2, Table 4, Table 5.
  • [19] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, pp. 2961–2969. Cited by: §2.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §4.1.
  • [21] M. Henaff, J. Bruna, and Y. LeCun (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: §1, §3.2, §3.2.
  • [22] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §3.2.
  • [23] T. Ke, J. Hwang, Z. Liu, and S. X. Yu (2018) Adaptive affinity fields for semantic segmentation. In ECCV, pp. 587–602. Cited by: Table 2.
  • [24] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1, §1, §2, §3.1.
  • [25] S. Kong and C. C. Fowlkes (2018) Recurrent scene parsing with perspective understanding in the loop. In CVPR, pp. 956–965. Cited by: Table 2.
  • [26] Q. Li, Z. Han, and X. Wu (2018) Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, Cited by: §1, §2.
  • [27] R. Li, S. Wang, F. Zhu, and J. Huang (2018) Adaptive graph convolutional neural networks. In AAAI, Cited by: §2.
  • [28] Y. Li and A. Gupta (2018) Beyond grids: learning graph representations for visual recognition. In NIPS, pp. 9245–9255. Cited by: §1, §1, §2, §3.1, §3.2, §3.4.
  • [29] X. Liang, Z. Hu, H. Zhang, L. Lin, and E. P. Xing (2018) Symbolic graph reasoning meets convolutions. In NIPS, pp. 1858–1868. Cited by: §1, §1, §2, §3.1, §3.2, §3.4, Table 1, Table 4, Table 5.
  • [30] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan (2016)

    Semantic object parsing with local-global long short-term memory

    .
    In CVPR, pp. 3185–3193. Cited by: §2.
  • [31] X. Liang, H. Zhou, and E. Xing (2018) Dynamic-structured semantic propagation network. In CVPR, pp. 752–761. Cited by: Table 4.
  • [32] R. Liao, Z. Zhao, R. Urtasun, and R. S. Zemel (2019) LanczosNet: multi-scale deep graph convolutional networks. In ICLR, Cited by: §1.
  • [33] D. Lin, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang (2018) Multi-scale context intertwining for semantic segmentation. In ECCV, pp. 603–619. Cited by: §4.5, Table 5.
  • [34] G. Lin, A. Milan, C. Shen, and I. Reid (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In CVPR, pp. 1925–1934. Cited by: §2, Table 2, Table 4.
  • [35] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: §1, §2, §3.3.
  • [36] W. Liu, A. Rabinovich, and A. C. Berg (2015) Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579. Cited by: §2.
  • [37] Z. Liu, X. Li, P. Luo, C. Loy, and X. Tang (2015) Semantic image segmentation via deep parsing network. In ICCV, pp. 1377–1385. Cited by: §2.
  • [38] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, pp. 3431–3440. Cited by: §1, §2.
  • [39] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. Yuille (2014) The role of context for object detection and semantic segmentation in the wild. In CVPR, pp. 891–898. Cited by: §4.1.
  • [40] H. Noh, S. Hong, and B. Han (2015) Learning deconvolution network for semantic segmentation. In ICCV, pp. 1520–1528. Cited by: §2.
  • [41] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §2.
  • [42] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In CVPR, pp. 761–769. Cited by: §4.2.
  • [43] B. Shuai, Z. Zuo, B. Wang, and G. Wang (2018) Scene segmentation with dag-recurrent neural networks. TPAMI 40 (6), pp. 1480–1493. Cited by: §1, §2.
  • [44] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. In ICLR, Cited by: §2.
  • [45] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell (2018) Understanding convolution for semantic segmentation. In WACV, pp. 1451–1460. Cited by: Table 2.
  • [46] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §1, §2, Table 1.
  • [47] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang (2018) Denseaspp for semantic segmentation in street scenes. In CVPR, pp. 3684–3692. Cited by: Table 2.
  • [48] Y. Yang, Z. Zhong, T. Shen, and Z. Lin (2018-06) Convolutional neural networks with alternately updated clique. In CVPR, Cited by: §1.
  • [49] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and J. Leskovec (2018) Hierarchical graph representation learning with differentiable pooling. In NIPS, pp. 4805–4815. Cited by: §1, §2, §2, §3.1.
  • [50] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Learning a discriminative feature network for semantic segmentation. In CVPR, pp. 1857–1866. Cited by: Table 2.
  • [51] F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. In ICLR, Cited by: §1, §1.
  • [52] M. D. Zeiler, G. W. Taylor, R. Fergus, et al. (2011) Adaptive deconvolutional networks for mid and high level feature learning.. In ICCV, Vol. 1, pp. 6. Cited by: §2.
  • [53] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018) Context encoding for semantic segmentation. In CVPR, pp. 7151–7160. Cited by: §2, Table 5.
  • [54] M. Zhang, Z. Cui, M. Neumann, and Y. Chen (2018) An end-to-end deep learning architecture for graph classification. In AAAI, Cited by: §2, §3.1.
  • [55] R. Zhang, S. Tang, Y. Zhang, J. Li, and S. Yan (2017) Scale-adaptive convolutions for scene parsing. In ICCV, pp. 2031–2039. Cited by: Table 2.
  • [56] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, pp. 2881–2890. Cited by: §1, §2, §2, §4.1, §4.1, Table 2, Table 5.
  • [57] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia (2018) Psanet: point-wise spatial attention network for scene parsing. In ECCV, pp. 267–283. Cited by: §2, Table 2.
  • [58] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr (2015) Conditional random fields as recurrent neural networks. In ICCV, pp. 1529–1537. Cited by: §2.