Strip Pooling: Rethinking Spatial Pooling for Scene Parsing

03/30/2020 ∙ by Qibin Hou, et al. ∙ 7

Spatial pooling has been proven highly effective in capturing long-range contextual information for pixel-wise prediction tasks, such as scene parsing. In this paper, beyond conventional spatial pooling that usually has a regular shape of NxN, we rethink the formulation of spatial pooling by introducing a new pooling strategy, called strip pooling, which considers a long but narrow kernel, i.e., 1xN or Nx1. Based on strip pooling, we further investigate spatial pooling architecture design by 1) introducing a new strip pooling module that enables backbone networks to efficiently model long-range dependencies, 2) presenting a novel building block with diverse spatial pooling as a core, and 3) systematically comparing the performance of the proposed strip pooling and conventional spatial pooling techniques. Both novel pooling-based designs are lightweight and can serve as an efficient plug-and-play module in existing scene parsing networks. Extensive experiments on popular benchmarks (e.g., ADE20K and Cityscapes) demonstrate that our simple approach establishes new state-of-the-art results. Code is made available at https://github.com/Andrew-Qibin/SPNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

page 6

page 7

page 9

page 10

Code Repositories

SPNet

Code for our CVPR2020 paper "Strip Pooling: Rethinking Spatial Pooling for Scene Parsing"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene parsing, also known as semantic segmentation, aims to assign a semantic label to each pixel in an image. As one of the most fundamental tasks, it has been applied in a wide range of computer vision and graphics applications 

[ChengSurveyVM2017], such as autonomous driving [teichmann2018multinet], medical diagnosis [ronneberger2015u], image/video editing [murali2019single, le2019object], salient object detection [BorjiCVM2019], and aerial image analysis [maggiori2017high]. Recently, methods [long2015fully, chen2017deeplab] based on fully convolutional networks (FCNs) have made extraordinary progress in scene parsing with their ability to capture high-level semantics. However, these approaches mostly stack local convolutional and pooling operations, thus are hardly able to well cope with complex scenes with a variety of different categories due to the limited effective fields-of-view [zhao2016pyramid, huang2018ccnet].

One way to improve the capability of modeling the long-range dependencies in CNNs is to adopt self-attention or non-local modules [wang2018non, huang2018ccnet, chen2016attention, ren2017end, hong2016learning, yang2014context, zhao2018psanet, zhang2020dynamic, zhang2019dual, li2019global]

. However, they notoriously consume huge memory for computing the large affinity matrix at each spatial position. Other methods for long-range context modeling include: dilated convolutions 

[chen2017deeplab, chen2018encoder, chen2017rethinking, yu2015multi] that aim to widen the receptive fields of CNNs without introducing extra parameters; or global/pyramid pooling [lazebnik2006beyond, zhao2016pyramid, he2019adaptive, chen2017deeplab, chen2018encoder, yang2018denseaspp] that summarizes global clues of the images. However, a common limitation for these methods, including dilated convolutions and pooling, is that they all probe the input features map within square windows. This limits their flexibility in capturing anisotropy context that widely exists in realistic scenes. For instance, in some cases, the target objects may have long-range banded structure (e.g., the grassland in Figure 1b) or distributed discretely (e.g., the pillars in Figure 1a). Using large square pooling windows cannot well solve the problem because it would inevitably incorporate contaminating information from irrelevant regions [he2019adaptive].

In this paper, to more efficiently and effectively capture long-range dependencies, we exploit spatial pooling for enlarging the receptive fields of CNNs and collecting informative contexts, and present the concept of strip pooling. As an alternative to global pooling, strip pooling offers two advantages. First, it deploys a long kernel shape along one spatial dimension and hence enables capturing long-range relations of isolated regions, as shown in the top part of Figures 1a and 1c. Second, it keeps a narrow kernel shape along the other spatial dimension, which facilitates capturing local context and prevents irrelevant regions from interfering the label prediction. Integrating such long but narrow pooling kernels enables the scene parsing networks to simultaneously aggregate both global and local context. This is essentially different from the traditional spatial pooling which collects context from a fixed square region.

[width=0.48]figures/obs.pdf (a)(b)(c)(d)

Figure 1: Illustrations on how strip pooling and spatial pooling work differently for scene parsing. From top to bottom: strip pooling; conventional spatial pooling; ground-truth annotations; our results with conventional spatial pooling only; our results with strip pooling considered. As shown in the top row, compared to conventional spatial pooling (green grids), strip pooling has a kernel of band shape (red grids) and hence can capture long-range dependencies between regions distributed discretely (yellow bounding boxes).

Based on the strip pooling operation, we present two pooling based modules for scene parsing networks. First, we design a Strip Pooling Module (SPM) to effectively enlarge the receptive field of the backbone. More concretely, the SPM consists of two pathways, which focus on encoding long-range context along either the horizontal or vertical spatial dimension. For each spatial location in the pooled map, it encodes its globally horizontal and vertical information and then uses the encodings to balance its own weight for feature refinement. Furthermore, we present a novel add-on residual building block, called the Mixed Pooling module (MPM), to further model long-range dependencies at high semantic level. It gathers informative contextual information by exploiting pooling operations with different kernel shapes to probe the images with complex scenes. To demonstrate the effectiveness of the proposed pooling-based modules, we present SPNet which incorporates both modules into the ResNet [He2016] backbone. Experiments show that our SPNet establishes new state-of-the-art results on popular scene parsing benchmarks.

The contributions of this work are as follows: (i) We investigate the conventional design of the spatial pooling and present the concept of strip pooling, which inherits the merits of global average pooling to collect long-range dependencies and meanwhile focus on local details. (ii) We design a Strip Pooling Module and a Mixed Pooling Module based on strip pooling. Both modules are lightweight and can serve as efficient add-on blocks to be plugged into any backbone networks to generate high-quality segmentation predictions. (iii) We present SPNet integrating the above two pooling-based modules into a single architecture, which achieves significant improvements over the baselines and establishes new state-of-the-art results on widely-used scene parsing benchmark datasets.

2 Related Work

Current state-of-the-art scene parsing (or semantic segmentation) methods mostly leverage convolutional neural networks (CNNs). However, the receptive fields of CNNs grow slowly by stacking the local convolutional or pooling operators, which therefore hampers them from taking enough useful contextual information into account. Early techniques for modeling contextual relationships for scene parsing involve the conditional random fields (CRFs) 

[krahenbuhl2012efficient, vemulapalli2016gaussian, arnab2016higher, zheng2015conditional]. They are mostly modeled in the discrete label space and computationally expensive, thus are now less successful for producing state-of-the-art results of scene parsing albeit have been integrated into CNNs.

For continuous feature space learning, prior work use multi-scale feature aggregation [long2015fully, chen2017deeplab, lin2016efficient, hariharan2015hypercolumns, noh2015learning, lin2018multi, lin2017refinenet, badrinarayanan2017segnet, peng2017large, bulo2017loss, tian2019decoders, pami20Res2net] to fuse the contextual information by probing the incoming features with filters or pooling operations at multiple rates and multiple fields-of-view. DeepLab [chen2017deeplab, chen2017rethinking] and its follow-ups [chen2018encoder, yang2018denseaspp, mehta2018espnet] adopt dilated convolutions and fuse different dilation rate features to increase the receptive filed of the network. Besides, aggregating non-local context [liu2017learning, yuan2018ocnet, li2019expectation, ding2018context, chen2016attention, ren2017end, hong2016learning, yang2014context, zhao2018psanet, huang2018ccnet, ding2019semantic] is also effective for scene parsing.

Another line of research on improving the receptive field is the spatial pyramid pooling [zhao2016pyramid, he2019adaptive]. By adopting a set of parallel pooling operations with a unique kernel size at each pyramid level, the network is able to capture large-range context. It has been shown promising on several scene parsing benchmarks. However, its ability to exploit contextual information is limited since only square kernel shapes are applied. Moreover, the spatial pyramid pooling is only modularized on top of the backbone network thus rendering it is not flexible or directly applicable in the network building block for feature learning. In contrast, our proposed strip pooling module and mixed pooling module adopt pooling kernels with size or , both of which can be plugged and stacked into existing networks. This difference enables the network to exploit rich contextual relationships in each of the proposed building blocks. The proposed modules have proven to be much more powerful and adaptable than the spatial pyramid pooling in our experiments.

3 Methodology

Figure 2: Schematic illustration of the Strip Pooling (SP) module.

In this section, we first give the concept of strip pooling and then introduce two model designs based on strip pooling to demonstrate how it improves scene parsing networks. Finally, we describe the entire architecture of the proposed scene parsing network augmented by strip pooling.

3.1 Strip Pooling

Before describing the formulation of strip pooling, we first briefly review the average pooling operation.

Standard Spatial Average Pooling: Let

be a two-dimensional input tensor, where

and are the spatial height and width, respectively. In an average pooling layer, a spatial extent of the pooling () is required. Consider a simple case where divides and divides . Then the output after pooling is also a two-dimensional tensor with height and width . Formally, the average pooling operation can be written as

(1)

where and . In Eqn. 1, each spatial location of corresponds to a pooling window of size . The above pooling operation has been successfully applied to previous work [zhao2016pyramid, he2019adaptive] for collecting long-range context. However, it may unavoidably incorporate lots of irrelevant regions when processing objects with irregular shapes as shown in Figure 1.

Strip Pooling: To alleviate the above problem, we present the concept of ‘strip pooling’ here, which uses a band shape pooling window to perform pooling along either the horizontal or the vertical dimension, as shown in the top row of Figure 1. Mathematically, given the two-dimensional tensor , in strip pooling, a spatial extent of pooling or is required. Unlike the two-dimensional average pooling, the proposed strip pooling averages all the feature values in a row or a column. Thus, the output after horizontal strip pooling can be written as

(2)

Similarly, the output after vertical strip pooling can be written as

(3)

Given the horizontal and vertical strip pooling layers, it is easy to build long-range dependencies between regions distributed discretely and encode regions with the banded shape, thanks to the long and narrow kernel shape. Meanwhile, it also focuses on capturing local details due to its narrow kernel shape along the other dimension. These properties make the proposed strip pooling different from conventional spatial pooling that relies on square-shape kernels. In the following, we will describe how to leverage strip pooling (Eqn. 2 and Eqn. 3) to improve scene parsing networks.

3.2 Strip Pooling Module

It has been demonstrated in previous work [chen2018encoder, fu2019dual] that enlarging the receptive fields of the backbone networks is beneficial to scene parsing. In this subsection, motivated by this fact, we introduce an effective way to help backbone networks capture long-range context by exploiting strip pooling. In particular, we present a novel Strip Pooling module (SPM), which leverages both horizontal and vertical strip pooling operations to gather long-range context from different spatial dimensions. Figure 2 depicts our proposed SPM. Let be an input tensor, where denotes the number of channels. We first feed into two parallel pathways, each of which contains a horizontal or vertical strip pooling layer followed by a 1D convolutional layer with kernel size 3 for modulating the current location and its neighbor features. This gives and . To obtain an output that contains more useful global priors, we first combine and together as follows, yielding :

(4)

Then, the output is computed as

(5)

where refers to element-wise multiplication,

is the sigmoid function and

is a

convolution. It should be noted that there are multiple ways to combine the features extracted by the two strip pooling layers, such as computing the inner product between two extracted 1D feature vectors. However, taking the efficiency into account and to make the SPM lightweight, we adopt the operations described above, which we find still work well.

In the above process, each position in the output tensor is allowed to build relationships with a variety of positions in the input tensor. For example, in Figure 2, the square bounded by the black box in the output tensor is connected to all the locations with the same horizontal or vertical coordinate as it (enclosed by red and purple boxes). Therefore, by repeating the above aggregation process a couple of times, it is possible to build long-range dependencies over the whole scene. Moreover, benefiting from the element-wise multiplication operation, the proposed SPM can also be considered as an attention mechanism and directly applied to any pretrained backbone networks without training them from scratch.

Compared to global average pooling, strip pooling considers long but narrow ranges instead of the whole feature map, avoiding most unnecessary connections to be built between locations that are far from each other. Compared to attention-based modules [fu2019dual, he2019adaptive] that need a large amount of computation to build relationships between each pair of locations, our SPM is lightweight and can be easily embedded into any building blocks to improve the capability of capturing long-range spatial dependencies and exploiting inter-channel dependencies. We will provide more analysis on the performance of our approach against existing attention-based methods.

3.3 Mixed Pooling Module

It turns out that the pyramid pooling module (PPM) is an effective way to enhance scene parsing networks [zhao2016pyramid]. However, PPM heavily relies on the standard spatial pooling operations (albeit with different pooling kernels at different pyramid levels), making it still suffers as analyzed in Section 3.1. Taking into account the advantages of both standard spatial pooling and the proposed strip pooling, we advance the PPM and design a Mixed Pooling Module (MPM) which focuses on aggregating different types of contextual information via various pooling operations to make the feature representations more discriminative.

The proposed MPM consists of two sub-modules that simultaneously capture short-range and long-range dependencies among different locations, which we find are both essential for scene parsing networks. For long-range dependencies, unlike previous work [zhang2018context, zhao2016pyramid, chen2018encoder] that use the global average pooling layer, we propose to gather such kind of clues by employing both horizontal and vertical strip pooling operations. A simplified diagram can be found in Figure 3(b). As analyzed in Section 3.2, the strip pooling makes connections among regions distributed discretely over the whole scene and encoding regions with banded structures possible. However, for cases where semantic regions are distributed closely, spatial pooling is also necessary for capturing local contextual information. Taking this into account, as depicted in Figure 3(a), we adopt a lightweight pyramid pooling sub-module for short-range dependency collection. It has two spatial pooling layers followed by convolutional layers for multi-scale feature extraction plus a 2D convolutional layer for original spatial information preserving. The feature maps after each pooling are with bin sizes of and , respectively. All three sub-paths are then combined by summation.

Based on the above two sub-modules, we propose to nest them into residual blocks [He2016] with bottleneck structure for parameter reduction and modular design. Specifically, before each sub-module, an convolutional layer is first used for channel reduction. The outputs from both sub-modules are concatenated together and then fed into another convolutional layer for channel expansion as done in [He2016]. Note that all convolutional layers, aside from the ones for channel reduction and expansion, are with kernel size or 3 (for 1D convolutional layers).

It is worth mentioning that unlike the spatial pyramid pooling modules [zhao2016pyramid, chen2018encoder], the proposed MPM is a kind of modularized design. The advantage is that it can be easily used in a sequential way to expand the role of the long-range dependency collection sub-module. We find that with the same backbone our network with only two MPMs (around 1/3 parameters of the original PPM [zhao2016pyramid]) performs even better than the PSPNet. In our experiment section, we will provide more results and analysis on this.

Figure 3: (a) Short-range dependency aggregation sub-module. (b) Long-range dependency aggregation sub-module. Inspired by [lin2017feature, liu2019simple], a convolutional layer is added after the fusion operation in each sub-module to reduce the aliasing effect brought by down-sampling operations.

3.4 Overall Architecture

Based on the proposed SPM and MPM, we introduce an overall architecture, called SPNet, in this subsection. We adopt the classic residual networks [He2016] as our backbones. Following [chen2017deeplab, zhao2016pyramid, fu2019dual], we improve the original ResNet with the dilation strategy and the final feature map size is set to 1/8 of the input image. The SPMs are added after the convolutional layer of the last building block in each stage and all building blocks in the last stage. All convolutional layers in an SPM share the same number of channels to the input tensor.

For the MPM, we directly build it upon the backbone network because of its modular design. Since the output of the backbone is with 2048 channels, we first connect a convolutional layer to the backbone to reduce the output channels from 2048 to 1024 and then add two MPMs. In each MPM, following [He2016], all convolutional layers with kernel size or 3 have 256 channels (i.e. a reduction rate of 1/4 is used). A convolutional layer is added at the end to predict the segmentation map.

4 Experiments

We evaluate the proposed SPM and MPM on popular scene parsing datasets, including ADE20K [zhou2017scene], Cityscapes [cordts2016cityscapes], and Pascal Context [mottaghi2014role]. Moreover, we also conduct comprehensive ablation analysis on the effect of the proposed strip pooling based on the ADE20K dataset as done in [zhao2016pyramid].

4.1 Experimental Setup

Our network is implemented based on two public toolboxes [semseg2019, encoding2018]

and Pytorch 

[paszke2019pytorch]. We use 4 GPUs to run all the experiments. The batch size is set to 8 for Cityscapes and 16 for other datasets during training. Following most previous works [chen2017deeplab, zhao2016pyramid, zhang2018context], we adopt the ‘poly’ learning rate policy (i.e. the base one multiplying

) in training. The base learning rate is set to 0.004 for ADE20K and Cityscapes datasets and 0.001 for the Pascal Context dataset. The power is set to 0.9. The training epochs are as follows: ADE20K (120), Cityscapes (180), and Pascal Context (100). Momentum and weight decay rate are set to 0.9 and 0.0001, respectively. We use synchronized Batch Normalization in training as done in

[zhang2018context, zhao2016pyramid].

For data augmentation, similar to [zhao2016pyramid, zhang2018context], we randomly flip and rescale the input images from 0.5 to 2 and finally crop the image to a fixed size of for Cityscapes and

for others. By default, we report results under the standard evaluation metric—mean Intersection of Union (mIoU). For datasets with no ground-truth annotations available, we get results from the official evaluation servers. For all experiments, we use cross-entropy loss to optimize all models. Following 

[zhao2016pyramid], we exploit an auxiliary loss (connected to the last residual block of the forth stage) and the loss weight is set to 0.4. We also report multi-model results to fairly compare our approach with others, i.e. 

averaging the segmentation probability maps from multiple image scales

as in [lin2017refinenet, zhao2016pyramid, zhang2018context].

Settings #Params SPM mIoU Pixel Acc
Base FCN 27.7 M 37.63 77.60%
Base FCN + PPM [zhao2016pyramid] +21.0 M 41.68 80.04%
Base FCN + 1 MPM +4.4 M 40.50 79.60%
Base FCN + 2 MPM +8.8 M 41.92 80.03%
Base FCN + 2 MPM +11.9 M 44.03 80.65%
Table 1: Ablation analysis on the number of mixed pooling modules (MPMs). ‘SPM’ refers to the strip pooling module. As can be seen, when more MPMs are used, better results are yielded. All results are based on ResNet-50 backbone and single-model test. Best result is highlighted in bold.

4.2 Ade20k

The ADE20K dataset [zhou2017scene] is one of the most challenging benchmarks, which contains 150 classes and a variety of scenes with 1,038 image-level labels. We follow the official protocal to split the whole dataset. Like most previous works, we use both pixel-wise accuracy (Pixel Acc.) and mean of Intersection over Union (mIoU) for evaluation. We also adopt multi-model test and use the averaged results for evaluation following [lin2017refinenet, zhao2016pyramid]. For ablation experiments, we adopt ResNet-50 as our backbone as done in [zhao2016pyramid]. When comparing with prior works, we use ResNet-101.

Settings w/ SPM mIoU Pixel Acc
Base FCN 37.63 77.60%
Base FCN + 2 MPM (SRD only) 40.50 79.34%
Base FCN + 2 MPM (LRD only) 41.14 79.64%
Base FCN + 2 MPM (SRD + LRD) 41.92 80.03%
Base FCN + 2 MPM (SRD + LRD) 44.03 80.65%
Table 2: Ablation analysis on the mixed pooling module (MPM). ‘SPM’ refers to the strip pooling module. ‘SRD’ and ‘LRD’ denote the short-range dependency aggregation sub-module and the long-range dependency aggregation sub-module, respectively. As can be seen, collecting both short-range and long-range dependencies are essential for yielding better segmentation results. All results are based on single-model test.

4.2.1 Ablation Studies

Number of MPMs: As stated in Section 3.3, the MPM is built based on the bottleneck structure of residual blocks [He2016] and hence can be easily repeated multiple times to expand the role of strip pooling. Here, we investigate how many MPMs are needed to balance the performance and the runtime cost of the proposed approach. As shown in Table 1, we list the results when different numbers of MPMs are used based on the ResNet-50 backbone. One can see when no MPM is used (base FCN), we achieve a result of 37.63% in terms of mIoU. When 1 MPM is used, we have a result of 40.50%, i.e. around 3.0% improvement. Furthermore, when we add two MPMs to the backbone, a performance gain of around 4.3% can be obtained. However, adding more MPMs gives trivial performance gain. This may be because the receptive field is already large enough. As a result, regarding the runtime cost, we set the number of MPMs to 2 by default.

To show the advantages of the proposed MPM over PPM [zhao2016pyramid], we also show the result and the parameter number of PSPNet in Table 1. It can be easily seen that the setting of ‘Base FCN + 2 MPM’ already performs better than PSPNet despite 12M fewer parameters than PSPNet. This phenomenon demonstrates that our modularized design of MPM is much more effective than PPM.

Effect of strip pooling in MPMs: It has been described in Section 3.3 that the proposed MPM contains two sub-modules for collecting short-range and long-range dependencies, respectively. Here, we ablate the importance of the proposed strip pooling. The corresponding results are shown in Table 2. Obviously, collecting long-range dependencies with strip pooling (41.14%) is more effective than collecting only short-range dependencies (40.5%), but gathering both of them further improves (41.92%). To further demonstrate how the strip pooling works in MPM, we visualize some feature maps at different positions of MPM in Figure 5 and some segmentation results under different settings of MPM in Figure 4. Clearly, the proposed strip pooling can more effectively collect long-range dependencies. For example, the feature map output from the long-range dependency aggregation module (LRD) in the top row of Figure 5 can accurately locate where the sky is. However, global average pooling cannot do this because it encodes the whole feature map to a single value.

Effectiveness of SPMs: We empirically find that there is no need to add the proposed SPM to each building block of the backbone network despite its light weight. In this experiment, we consider four scenarios, which are listed in Table 3. We take the base FCN followed by 2 MPMs as the baseline. We first add an SPM to the last building block in each stage; the resulting mIoU score is 42.61%. Second, we attempt to add SPMs to all the building blocks in the last stage, and find the performance slightly declines to 42.30%. Next, when we add SPMs to both the above positions, an mIoU score of 44.03% can be yielded. However, when we attempt to add SPMs to all the building blocks of the backbone, there is nearly no performance gain already. Regarding the above results, by default, we add SPMs to the last building block of each stage and all the building blocks of the last stage. In addition, when we take only the base FCN as our baseline and add the proposed SPMs, the mIoU score increases from 37.63% to 41.66%, achieving an improvement of nearly 4%. All the above results indicate that adding SPMs to the backbone network does benefit the scene parsing networks.

[width=]figures/MP_comp.pdf (a) Image(b) GT(c) 2 SRD(d) 2 LRD(e) 2 MPM

Figure 4: Visual comparisons among different settings of the MP module (MPM). ‘2 SRD’ means we use 2 MPMs with only the short-range dependency aggregation module included and ‘2 LRD’ means we use 2 MPMs with only the long-range dependency aggregation module included.

[width=]figures/feats.pdf (a) Image(b) GT(c) After VSP(d) After HSP(e) After LRD(f) After SRD(g) After MPM(h) Results

Figure 5: Visualization of selected feature maps at different positions of the proposed MP module. VSP: vertical strip pooling; HSP: horizontal strip pooling; SRD: short-range dependency aggregation sub-module (Figure 3a); LRD: long-range dependency aggregation sub-module (Figure 3b); MPM: mixed pooling module.
Settings SPM Position #MPM mIoU Pixel Acc.
Base FCN - 2 41.92 80.03%
Base FCN + SPM L 2 42.61 80.38%
Base FCN + SPM A 2 42.30 80.22%
Base FCN + SE [hu2018squeeze] A + L 2 41.34 80.05%
Base FCN + SPM A + L 0 41.66 79.69%
Base FCN + SPM A + L 2 44.03 80.65%
Table 3: Ablation analysis on the strip pooling module (SPM). L: Last building block in each stage. A: All building blocks in the last stage. As can be seen, SPM can largely improve the performance of the base FCN from 37.63 to 41.66.
Settings Multi-Scale + Flip mIoU (%) Pixel Acc. (%)
SPNet-50 44.03 80.65
SPNet-50 45.03 81.32
SPNet-101 44.52 81.37
SPNet-101 45.60 82.09
Table 4: More ablation experiments when different backbone networks are used.

Strip Pooling v.s. Global Average Pooling: To demonstrate the advantages of the proposed strip pooling over the global average pooling, we attempt to change the strip pooling operations in the proposed SPM to global average pooling. Taking the base FCN followed by 2 MPMs as the baseline, when we add SPMs to the base FCN, the performance increases from 41.92% to 44.03%. However, when we change the proposed strip pooling to global average pooling as done in [hu2018squeeze], the performance drops from 41.92% to 41.34%, which is even worse than the baseline as shown in Table 3. This may be due to directly fusing feature maps to construct a 1D vector which leads to loss of too much spatial information and hence ambiguity as pointed out in the previous work [zhao2016pyramid].

More experiment analysis: In this part, we show the influence of different experiment settings on the performance, including the depth of the backbone network and multi-scale test with flipping. As listed in Table 4, multi-scale test with flipping can largely improve the results for both backbones. Moreover, using deeper backbone networks also benefits the performance (ResNet-50: 45.03% ResNet-101: 45.60%).

Visualization: In Figure 6, we show some visual results under different settings of the proposed approach. Obviously, adding either MPM or SPM to the base FCN can effectively improve the segmentation results. When both MPM and SPM are considered, the quality of the segmentation maps can be further enhanced.

Method Backbone mIoU (%) Pixel Acc. (%) Score
RefineNet [lin2017refinenet] ResNet-152 40.70 - -
PSPNet [zhao2016pyramid] ResNet-101 43.29 81.39 62.34
PSPNet [zhao2016pyramid] ResNet-269 44.94 81.69 63.32
SAC [zhang2017scale] ResNet-101 44.30 81.86 63.08
EncNet [zhang2018context] ResNet-101 44.65 81.69 63.17
DSSPN [liang2018dynamic] ResNet-101 43.68 81.13 62.41
UperNet [xiao2018unified] ResNet-101 42.66 81.01 61.84
PSANet [zhao2018psanet] ResNet-101 43.77 81.51 62.64
CCNet [huang2018ccnet] ResNet-101 45.22 - -
APNB [zhu2019asymmetric] ResNet-101 45.24 - -
APCNet [he2019adaptive] ResNet-101 45.38 - -
SPNet (Ours) ResNet-50 45.03 81.32 63.18
SPNet (Ours) ResNet-101 45.60 82.09 63.85
Table 5: Comparisons with the state-of-the-arts on the validation set of ADE20K [zhou2017scene]. We report both mIoU and Pixel Acc. on this benchmark. Best results are highlighted in bold.

4.2.2 Comparison with the State-of-the-Arts

Here, we compare the proposed approach with previous state-of-the-art methods. The results can be found in Table 5. As can be seen, our approach with ResNet-50 as backbone reaches an mIoU score of 45.03% and pixel accuracy of 81.32%, which are already better than most of the previous methods. When taking ResNet-101 as our backbone, we achieve new state-of-the-art results in terms of both mIoU and pixel accuracy.

[width=]figures/vis_res.pdf (a) Image(b) GT(c) Base FCN(d) 1 MPM only(e) 2 MPM only(f) SPM only(g) SPNet

Figure 6: Visual results of the proposed approach under different model settings.

4.3 Cityscapes

Cityscapes [cordts2016cityscapes] is another popular dataset for scene parsing, which contains totally 19 classes. It consists of 5K high-quality pixel-annotated images collected from 50 cities in different seasons, all of which are with pixels. As suggested by previous work, we split the whole dataset into three splits for training, validation, and test, which contain 2,975, 500, and 1,525 images, respectively.

For a fair comparison, we adopt ResNet-101 as the backbone network. We compare our approach with existing methods on the test set. Following previous work [fu2019dual], we train our network with only fine annotated data and submit the results to the official server. The results can be found in Table 6. It is obvious that the proposed approach outperforms all other methods.

Method Publication Backbone Test mIoU
SAC [zhang2017scale] ICCV’17 ResNet-101 78.1%
DUC-HDC [wang2018understanding] WACV’18 ResNet-101 80.1%
DSSPN [liang2018dynamic] CVPR’18 ResNet-101 77.8%
DepthSeg [kong2018recurrent] CVPR’18 ResNet-101 78.2%
DFN [yu2018learning] CVPR’18 ResNet-101 79.3%
DenseASPP [yang2018denseaspp] CVPR’18 DenseNet-161 80.6%
BiSeNet [yu2018bisenet] ECCV’18 ResNet-101 78.9%
PSANet [zhao2018psanet] ECCV’18 ResNet-101 80.1%
DANet [fu2019dual] CVPR’19 ResNet-101 81.5%
SPGNet [cheng2019spgnet] ICCV’19 ResNet-101 81.1%
APNB [zhu2019asymmetric] ICCV’19 ResNet-101 81.3%
CCNet [huang2018ccnet] ICCV’19 ResNet-101 81.4%
SPNet (Ours) - ResNet-101 82.0%
Table 6: Comparisons with the state-of-the-arts on the Cityscapes test set [cordts2016cityscapes].

4.4 Pascal Context

Pascal Context dataset [mottaghi2014role] has 59 categories and 10,103 images with dense label annotations, which are divided to 4,998 images for training and 5,015 for testing. Quantitative results can be found in Table 7. As can be seen, our approach works much better than other methods.

Method Publication Backbone mIoU (%)
CRF-RNN [zheng2015conditional] ICCV’15 VGGNet 39.3
BoxSup [dai2015boxsup] ICCV’15 VGGNet 40.5
Piecewise [lin2016efficient] CVPR’16 VGGNet 43.3
DeepLab-v2 [chen2017deeplab] PAMI’17 ResNet-101 45.7
RefineNet [lin2017refinenet] CVPR’17 ResNet-152 47.3
CCL [zhang2018context] CVPR’18 ResNet-101 51.6
EncNet [zhang2018context] CVPR’18 ResNet-101 52.6
DANet [fu2019dual] CVPR’19 ResNet-101 52.6
SVCNet [ding2019semantic] CVPR’19 ResNet-101 53.2
EMANet [li2019expectation] ICCV’19 ResNet-101 53.1
APNB [zhu2019asymmetric] ICCV’19 ResNet-101 52.8
BFP [ding2019boundary] ICCV’19 ResNet-101 53.6
SPNet (Ours) - ResNet-101 54.5
Table 7: Comparisons with the state-of-the-arts on the Pascal Context dataset [mottaghi2014role].

5 Conclusions

In this paper, we present a new type of spatial pooling operation, strip pooling. Its long but narrow pooling window allows the model to collect rich global contextual information that is essential for scene parsing networks. Based on both strip and spatial pooling operations, we design a novel strip pooling module to increase the receptive fields of the backbone network and present a mixed pooling module based on the classic residual block with bottleneck structure. Experiments on several widely-used datasets demonstrate the effectiveness of the proposed approach.

Acknowledgement. This research was partially supported by AI.SG R-263-000-D97-490, NUS ECRA R-263-000-C87-133, MOE Tier-II R-263-000-D17-112, NSFC (61922046), the national youth talent support program, and Tianjin Natural Science Foundation (17JCJQJC43700).

References