Semantic segmentation 
, as a step towards scene understanding[56, 31, 69, 55]
, is a challenging problem in computer vision. It refers to the task of assigning semantic labels, such as person and sky, to every pixel within an image. Recently, Deep Convolutional Neural Networks (DCNNs)[32, 30] have significantly improved the performance of semantic segmentation systems.
In particular, DCNNs, deployed in a fully convolutional manner [50, 40], have attained remarkable results on several semantic segmentation benchmarks [16, 14, 77]. We observe two key design components shared among state-of-the-art semantic segmentation systems. First, multi-scale context module, exploiting the large spatial information, enriches the local features. Typical examples include DeepLab  which adopts several parallel atrous convolutions [22, 44] with different rates and PSPNet  which performs pooling operations at different grid scales. Recently, SENets  and GENets  employ the ‘squeeze-and-excite’ (Figure 1 (a)) or more general ‘gather-and-excite’ framework (Figure 1 (b)) and obtain remarkable results on image classification task. Motivated by this, we propose a simple yet effective attention module, called Semantic Prediction Guidance (SPG), which learns to re-weight the local feature map values via the guidance from pixel-wise semantic prediction. Unlike the ‘gather-and-excite’ module [25, 24] (where context information is gathered from a large spatial extent and local features are excited accordingly), our SPG module adopts the ‘supervise-and-excite’ framework (Figure 1 (c)). Specifically, we inject the semantic supervision to the feature maps followed by a simple
convolution with sigmoid activation function (i.e, ‘supervise’ step). The resulting feature maps, called “Guided Attention”, are used as a guidance to re-weight the other transformed feature maps correspondingly (i.e, ‘excite’ step). We further add an ‘identity’ mapping in the module, similar to the residual block . Additionally, our learned “Guided Attention” allows us to visually explain the “re-weighting” mechanism in our SPG module.
Another important design component is the encoder-decoder structure, where high-level semantic information is captured in the encoder path while the detailed low-level boundary information is recovered in the decoder path. The systems [43, 48, 3, 36, 46, 45, 1, 74, 65, 11, 62], employing the single-stage encoder-decoder structure (i.e, the encoder-decoder structure is stacked only once), have demonstrated outstanding performance on several semantic segmentation benchmarks. On the other hand, the multi-stage encoder-decoder models [61, 5, 41, 42, 68, 28, 33], also known as stacked hourglass networks , refine the keypoint estimation iteratively by propagating information across stages for the task of human pose estimation. Interestingly, we observe that the multi-stage encoder-decoder structure is seldom explored in the context of semantic segmentation, except [19, 51]. In this work, we revisit the multi-stage encoder-decoder networks on the Cityscapes dataset . We find that by carefully selecting features across stages, a two-stage encoder-decoder network coupled with our proposed SPG module can significantly outperform its one-stage counterpart with similar parameters and computations.
On Cityscapes test set , our proposed SPGNet outperforms the strong baseline DenseASPP  when only exploiting the ‘fine’ annotations. Our overall mIoU is slightly behind the concurrent work DANet  but detailed class-wise mIoU reveals that our model is better than DANet in 14 out of 19 semantic classes. Furthermore, our SPGNet requires only computation of DANet .
To summarize our main contributions:
We propose a simple yet effective attention module, called SPG, which adopts a ‘supervise-and-excite’ framework.
We explore multi-stage encoder-decoder networks on semantic segmentation task. Incorporating our proposed SPG module to the multi-stage encoder-decoder networks further improves the performance.
We provide detailed ablation studies along with the visualization of our learned attention maps. We also discuss the effectiveness of employing multi-stage encoder-decoder networks on semantic segmentation.
2 Related Works
. The detailed object boundary information is usually missing due to the pooling or convolutions with striding operations within the network. To alleviate the problem, one could apply the atrous convolution[22, 50, 44, 7] to extract dense feature maps. However, it is computationally expensive to extract output feature maps that are 8 or even 4 times smaller than the input resolution using state-of-the-art network backbones [30, 52, 54, 20]. On the other hand, the encoder-decoder structures [43, 48, 3, 36, 46, 45, 1, 19, 74, 65, 11, 62, 27] capture the context information in the encoder path and recover high resolution features in the decoder path. Additionally, contextual information has also been explored. ParseNet  exploits the global context information, while PSPNet  uses spatial pyramid pooling at several grid scales. DeepLab [8, 9, 38, 67] uses several parallel atrous convolution with different rates in the Atrous Spatial Pyramid Pooling module, while DPC  applies neural architecture search  for the context module. Finally, our proposed Semantic Prediction Guidance (SPG) bears a similarity to the Layer Cascade method 
which treats each pixel differently. Instead of classifying easy pixels in the early stages within the network, our SPG module weights each pixel according to the predictions in the first stage of our stacked network.
Multi-Stage Networks: Multi-stage networks [61, 5, 41, 42, 68, 28, 33, 53, 59] have been widely used and explored in human pose estimation. Multi-stage networks aim to iteratively refine estimation. To maximally utilize the capacity of each stage, CPM  and Stacked Hourglass  propagate not only features to the next stage, but also remap predicted heatmaps into feature space by a 1x1 convolution and concatenate with feature maps. MSPN  further optimizes feature flow across stages by propagating intermediate features of encoder-decoder of previous stage to the next stage. MSPN  demonstrates superior performance over single stage counterpart with similar parameters and computations. On the other hand, Stacked Deconvolutional Network  uses multiple deconvolution networks for semantic segmentation. However, it only passes features across stages and neglects predictions of every stage. Additionally, Zhou et al propose a cascade segmentation module. In this work, we find predictions can be served as a special attention to propagate useful features across stages.
Attention Module: Attention mechanism has been widely used recently in multiple computer vision tasks. Chen et al learn an attention module to merge multi-scale features. Kong and Fowlkes  propose a gating module that adaptively selects features pooled with different field sizes. Recently, the self-attention module  has been explored by several works [23, 60, 18, 13, 26] for computer vision tasks. In contrast, our proposed SPG module is more similar to the other works that employ the ‘squeeze-and-excite’ or ‘gather-and-excite’ framework. In particular, Squeeze-and-Excitation Networks (SENets)  squeeze the features across spatial dimensions to aggregate the information for re-weighting feature channels. Hu et al generalize SENets with ‘gather-and-excite’ operations where long-range spatial information is gathered to re-weight (or ‘excite’) the local features. Motivated by this, our proposed SPG module employs the ‘supervise-and-excite’ framework, where our local features are guided by the semantic supervision. Additionally, EncNet  also adds supervision to their global feature. However, our supervision is pixel-level instead of image-level.
3.1 Overall Architecture
shows our proposed SPGNet, which consists of multiple stages and each stage is based on an encoder-decoder architecture: encoder produces dense feature maps at multiple scales and also an image-level feature vector using global average pooling (GAP). Decoder starts with this feature vector and gradually recovers spatial resolution by combining corresponding encoder feature map using an upsample module, described in Sec3.3.
Our SPGNet stacks multiple stages, where earlier decoder output is fed into a semantic prediction guidance (SPG) module (detailed in Sec 3.4) to generate input feature for the next stage. In addition, we employ Cross Stage Feature Aggregation  to enhance latter stage encoders by taking advantage of earlier stage encoder / decoder features, as shown in Figure 2(c). The decoder output in the final stage is bilinearly upsampled to input image resolution, generating per-pixel semantic prediction results.
The multi-stage design of SPGNet is inspired by Stacked Hourglass  for human pose estimation. Our method differs from Stacked Hourglass in 1) we carefully design the encoder-decoder architecture in each stage instead of using a symmetric hourglass network, and 2) latter stage input is generated from SPG module rather than simply passing the features combined with predictions from previous stage.
3.2 Encoder / Decoder Design
Hourglass networks 
assign equal computation to both encoder and decoder, making it unavailable to use pre-trained weights on ImageNet. In contrast, Feature Pyramid Networks (FPN) 
use well-designed classification networks for encoder and design a simple decoder consisting of only nearest-neighbor interpolations to upsample decoder feature maps. Our encoder-decoder design principles follow FPN (e.g, all the feature maps in the decoder contain 256 channels), but we employ two more components to make it more efficient and effective. First, we incorporate a global average pooling  after the output of encoder to generate the image-level features followed by another convolution to transform its feature channels to 256. Second, instead of using a single nearest-neighbor interpolation, we design an efficient upsample module, as described in the next section.
3.3 Upsample module
As illustrated in Figure 2(a, b), our decoder adopts upsample module to recover feature map resolution step-by-step. Specifically, each module in the decoder takes two input feature maps, one from encoder and one from previous layer output. The input from encoder is first transformed by a residual block to reduce the dimension to output channel of the decoder. Then, the input from previous layer output is bilinear upsampled and added to the transformed encoder output. Instead of passing this merged feature directly to next upsample module, we further add another residual block to better fuse features from two different sources.
3.4 Semantic Prediction Guidance
Using contextual information to re-weight feature channels [25, 24] has brought significant improvements to image classification task. This process usually includes a ‘gather’ step which collects information over a large spatial region. In contrast, in our multi-stage encoder-decoder network, the output features generated from each stage already contain information from multiple scales. This inspires us to design a simple yet effective SPG module (Figure 3) which treats the features from earlier stage as ‘gathered’ information. Specifically, the previous stage decoder output feature is first fed into a
convolution to produce per-class logits, where and are decoder output height and width, is the number of channels used in the decoder and is the number of semantic classes in the dataset. We then produce a per-pixel, per-channel Guided Attention mask from via a simple convolution followed by sigmoid activation. This Guided Attention will be element-wise multiplied to a transformed decoder feature map, generated from convolution on top of , resulting in an attention-augmented feature map. Similar to residual block, this feature map is added back to decoder output feature , followed by another convolution to produce input feature to the next stage encoder. During training, we minimize the loss of last-stage semantic prediction and per-class logits in all previous stages.
Our proposed SPG module differs from SENets  and GENets  on using supervised semantic predictions to guide the ‘excite’ step. We further verified having explicit supervision improves model performance. The benefit of our proposed SPG module are twofold: the ‘gather’ step is implicitly folded into the encoder-decoder architecture, which allows SPG module to be computationally efficient (about increase in FLOPs) and have a small memory footprint ( higher peak memory usage). Meanwhile, using semantic prediction makes SPG module more explainable. See Section 4.5 for visualization.
We perform experiments on the Cityscapes dataset , which contains 19 classes. There are 5,000 images with high quality annotation (called “fine”), divided into 2,975/500/1,525 images for training, validation and testing. We only use the “fine” annotation in this paper.
4.2 Implementation Details
Networks. We employ ResNet  in the encoder module. The “Stem” in Figure 2 consists of a convolution with stride followed by a max pooling with stride . We replace BatchNorm layers with synchronized Inplace-ABN , and adopt bilinear interpolation in all the upsampling operations.
Training settings. We use mini-batch SGD momentum optimizer with batch size 8, initial learning rate 0.01, momentum 0.9 and weight decay 0.0001. Following prior works , we use the “poly” learning rate schedule where the learning rate is scaled by . For data augmentation, we employ random scale between [0.5, 2.0] with a step size of 0.25, random flip and random crop. We train the model for iterations on “train”set for ablation study. To evaluate our model on the “test” set, we train the model on the concatenation of “train” and “val” set.
4.3 Comparison with State-of-the-Arts
#FLOPs take all matrix multiplication into account.
In Table 1, we report our Cityscape “test” set result. We only use “fine” annotations and thus compare with the other state-of-art models that adopt the same setting in the table. Similar to other models, we use the multi-scale inputs (scales = ) during inference. We also report the model parameters and computation FLOPs (w.r.t, a single input size).
Our best SPGNet model variant employs a 2-stage encoder-decoder structures with ResNet-50 as encoder backbone and decoder channels . Our model outperforms most top-performing approaches on Cityscapes with much less computation. Notably, most state-of-the-art methods are mainly based on systems using atrous convolutions to preserve feature maps resolution, which however requires a large amount of computation (as indicated by #FLOPs in Table 1). On the contrary, our proposed SPGNet, built on top of an efficient encoder-decoder structure, strikes a better trade-off between accuracy and speed.
To be concrete, the computation of our SPGNet is almost half of DenseASPP, the previous published state-of-the-art model using only fine annotations, but our performance is mIoU better. We also compare our SPGNet with another concurrent work DANet . Our computation is around of DANet with only mIoU degradation.
We further compare per-class results with the top-2 performing approaches in Table 2. Surprisingly, our SPGNet outperforms DenseASPP in 15 out of 19 classes and DANet in 14 out of 19 classes. The main degradation of our overall mIoU comes from the “truck” class which is IoU worse than DenseASPP and IoU worse than DANet. We think it is because there are only few “truck” annotations in Cityscapes and our SPGNet requires supervision for learning the guided attention.
4.4 Ablation Studies
Here, we provide ablation studies on Cityscapes val set.
Effect of SPG module.
We perform ablation studies on the SPG design in Table 3. The baseline is a simple 2-stage encoder-decoder network by directly passing the 1st stage decoder features to the 2nd stage encoder. This baseline model uses the Cross-Stage Feature Aggregation (CSFA)  which is slightly better than the case without CSFA by 0.18%. We first verify whether passing semantic prediction together with the decoder features to next stage is helpful. We transform the predictions from the 1st stage decoder output by applying a convolution. The sum of the transformed predictions and the 1st stage decoder output is passed to the next stage (denoted as SPG (sum)). It achieves mIoU which is mIoU better than the baseline. Additionally, our proposed SPG module uses the transformed semantic predictions to ‘excite’ the decoder features. We explore two ways for excitation: one is by applying softmax on the spatial dimension (SPG (softmax)) and the other is using sigmoid (SPG(sigmoid)). The SPG (softmax) scheme improves the baseline by mIoU while the SPG (sigmoid) scheme achieves the best mIoU of ( mIoU better than the baseline). Comparing results of SPG (sigmoid) scheme ( mIoU) with SPG (sum) scheme ( mIoU), it shows the importance of using ‘Excite’ to re-weight features. Finally, we investigate the effect of adding the identity mapping path and the supervision in SPG module. Dropping the identity mapping path in Figure 3 degrades the performance from to , while removing the supervision on learning the guided attention decreases the performance to in which our SPG module degenerates to a special case of ‘gather-and-excite’ (where the features are ‘gathered‘ from the 1st-stage decoder output).
SPG module vs SE/GE module.
To demonstrate the gain of SPG module comes from supervision, we compare SPG module with its unsupervised counterpart, i.eSE  and GE  modules. Using SE and GE modules achieves 77.09 mIoU and 77.22 mIoU respectively, both results are better than the baseline 76.31 mIoU and using GE is slightly better than SE which is consistent with the findings in . However, they are still worse than using our proposed supervise-and-excite (i.eSGP with 77.67 mIoU). The additional gain mainly comes from adding supervision in supervise-and-excite.
We experiment the effect of using more stages and the results are shown in Table 5. Similar to the situation in pose estimation that performance gets saturated as the number of stages increases. But in our case the performance saturates very quickly and achieves optimal with 2 stages. It is possible that by carefully balancing the loss weights among stages the performance might be better for models with more than 2 stages. However, for simplicity, we focus on models with only 2 stages in this paper.
Effect of encoder combination.
|Encoder combination||mIoU ()||#Params||#FLOPs|
|ResNet- + ResNet-||M||B|
|ResNet- + ResNet-||M||B|
|ResNet- + ResNet-||M||B|
|ResNet- + ResNet-||M||B|
Our two-stage SPGNet could potentially employ two different backbones in each encoder module. As shown in Table 6, although employing ResNet-+ResNet- (i.e., ResNet-18 in the 1st encoder and ResNet-50 in the 2nd encoder) and ResNet-+ResNet- have similar parameters and computation, using deeper model in the first stage outperforms the other one. We think it is crucial to “encode” the features in the early stage with a stronger backbone. Adopting R-+R- achieves the best performance. For simplicity, we only adopt the same network backbones in all the encoder modules in this paper.
In Table 7, we study the effect of adopting different backbones in the encoder module(s). We observe that using deeper encoder improves the result and using ResNet-50 in a 2-stage SPGNet achieves a good trade-off between #Params, #FLOPs and performance.
Hard example mining.
We study the effects of on-line hard example (or pixel) mining (OHEM) [63, 4, 67] in Table 8. We apply OHEM to all stages (i.e, the decoder output in each stage) in our SPGNet. As shown in the table, using OHEM consistently improves the performance.
We experiment on the effect of decoder channels in Table 9. Employing ResNet-50 as the encoder backbone and decoder channels achieves the best validation mIoU.
Flip and multi-scale test. We further add flip and multi-scale test to the best model (ResNet-50 with 2 stages, in Table 9). By adding scales = , the performance further improves from 80.91 to 81.86.
4.5 Visualization of Guided Attention
In this section, we visualize the learned Guided Attention in our best model variant (a stack of two encoder-decoder structures with ResNet-50 as encoder backbone). The Guided Attention maps (with 256 channels) is obtained by applying a convolution with sigmoid activation to the prediction in the 1st stage decoder output. Therefore, we have a convolution weight matrix with size (Figure 4 top-right), where is the number of semantic classes on the dataset. To visualize the attention for class , we would like to know which channels among the 256 channels in the Guided Attention map that the class contributes most. Therefore, for class , we extract the corresponding convolution weight vector (Figure 4 red row in matrix) from the matrix. In the vector, we then select the indexes of the top largest weights (Figure 4 yellow elements in vector), which is used to index the corresponding channels in the Guided Attention maps (Figure 4 yellow slices from the purple Guided Attention maps), i.e, those channels in the Guided Attention maps have the largest responses for the class . Then, we visualize the attention by taking norm of the selected channels.
General classes. We visualize the learned Guided Attention for four representative classes in Figure 5. ‘Car’ and ‘Person’ are most common ‘thing’ classes in the Cityscapes dataset. ‘Building’ is a common ‘stuff’ class and ‘Pole’ is a common thin ‘stuff’ in Cityscapes. The activations are normalized between 0 (blue color) and 1 (red color).
From Figure 5, we observe several interesting behaviors:
The guided attention learns localization of objects. The activations for ‘thing’ align quite well with the actual position of those objects.
Guided attention focus on object co-occurrence. For example, ‘Car’ and ‘Person’ objects are usually on the road and the attentions for these classes learn to focus on both corresponding instances and road.
Guided attention can find small objects. For example, there are multiple thin ‘Poles’ in the third row of Figure 5 and guided attention can find most of them.
Semantically similar classes. We find guided attention is also capable of differentiating semantically similar classes. In Figure 6, we visualize the attention for two semantically similar classes: ‘Person’ and ‘Rider’. The attention for ‘Rider’ mainly fires for the rider instance on the right, and it does not fire for the two person instances on the left of the image. Our guided attention makes the features, passed to the next stage, more discriminative to semantically similar classes through the injected supervision, allowing our SPGNet to achieve better results on both ‘Person’ and ‘Rider’ classes than other state-of-the-art models, as shown in Table. 2.
Failure cases. Our SPGNet confuses among ‘Truck’, ‘Bus’ and ‘Train’. We visualize the attentions for these classes in Figure 7. We observe that the Guided Attention maps for these classes usually activate together on the same object. It potentially produces features that are less discriminative to those classes, resulting in our worse performance on ‘Truck’, ‘Bus’ and ‘Train’, as shown in Table. 2.
4.6 Generalization to Other Datasets
|Method||Extra data||Multi-scale||mIoU ()|
|Liang et al ||✗||✓||63.57|
|Xia et al ||✓||✓||64.39|
|Fang et al ||✓||✓||67.60|
To demonstrate that our model can be generalized to other datasets, we perform more experiments on the PASCAL VOC 2012  and PASCAL Person-Part . For both datasets, we follow the settings in  to train the model with a crop size of , batch size of 28 for 30,000 iterations.
PASCAL VOC 2012: The SPGNet with a stack of 2 ResNet-50 achieves 77.33 mIoU. The performance of SPGNet is comparable with the current state-of-the-art ResNet-101 DeepLabV3+  which achieves 77.37 mIoU with encoder stride=32 for a fair comparison.
PASCAL Person-Part: Table 10 shows comparison with state-of-the-art results on Pascal Person-Part. Our SPGNet with a stack of 2 ResNet-50 achieves 67.23 mIoU with a single scale input, and 68.36 mIoU with multi-scale inputs. Note that our SPGNet does not require extra MPII training data , as used in [64, 17].
We have proposed the SPGNet which demonstrates state-of-the-art performance for semantic segmentation on Cityscapes. Our proposed SPG module employs the ‘supervise-and-excite’ framework, where the local features are reweighted via the guidance from semantic prediction. The Guided Attention maps within the SPG module allows us to visually interpret the corresponding reweighting mechanism. Our experimental results show that a two-stage encoder-decoder network paired with our SPG module can significantly outperform its one-stage counterpart with similar parameters and computations. Finally, we plan to explore a more computationally efficient encoder-decoder structure for semantic segmentation in the future.
This work is in part supported by IBM-Illinois Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM AI Horizons Network and Intelligence Advanced Research Projects Activity (IARPA) via contract D17PC00341, ARC DECRA DE190101315. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government. The authors thank Samuel Rota Bulò and Peter Kontschieder for the valuable discussion about the global pooling kernel size.
Appendix A Extra Ablation Studies
We provide extra ablation studies on Cityscapes val set.
Effect of upsample module.
|Upsample Module||mIoU ()||#Params||#FLOPs|
We perform experiments to demonstrate the effectiveness of our proposed upsample module. We compare the decoder equipped with our proposed upsample module against the one using FPN-style upsample module  (i.e, bilinear upsample + residual blocks vs. nearest-neighbor upsample + single convolutions). In these experiments, we use ResNet-18 for encoder and we do not use global average pooling. For a fair comparison, we follow  to implement FPN decoder module and only use the largest resolution feature maps for prediction. We also add synchronized Inplace-ABN after all convolutions in our FPN implementation. Decoder channels are set to for both cases. Results are shown in Table 11. FPN-style upsample module and our proposed module have similar parameters but our upsample module requires B fewer FLOPs than the FPN-style module, thanks to the bottleneck design in residual blocks. Furthermore, using our upsample module, the performance is almost mIoU better than the FPN-style upsample module.
Effect of global pooling.
|GAP||Test Strategy||mIoU ()||#Params||#FLOPs|
We experiment with the effect of Global Average Pooling (GAP) by using a single-stage encoder-decoder with ResNet-18 as encoder backbone. The GAP operation is deployed after the encoder features. The decoder module uses 128 channels.
We compare three strategies during inference:
GAP: Use global average pooling during inference on the image .
TILED: Crop overlapping patches within the image that have the same size as training crop size (e.g), and use overlap between patches (e.g, overlap with 256 pixels) .
AP: Replace global average pooling with an average pooling whose kernel size is the same as training crop size divided by the stride of that feature maps .
As shown in Table 12, we observe that using global average pooling (GAP) only improves the performance slightly by 1.3% due to the asymmetric setting during training and inference (i.e, train with crop size but inference with image size
). The TILED strategy resolves this problem by employing the same pooling kernel size during training and inference. However, it introduces extra computation since it requires processing redundant pixels within the overlapped regions among patches. Furthermore, it requires some heuristics to resolve the conflicts within the overlapped regions (e.g, average the predictions in the overlapped regions), which may lead to sub-optimal merging. On the other hand, the AP strategy is more efficient than the TILED strategy and performs slightly better, since no overlapped regions are processed.
-  Md Amirul Islam, Mrigank Rochan, Neil DB Bruce, and Yang Wang. Gated feedback refinement network for dense image labeling. In CVPR, 2017.
-  Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
-  Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE TPAMI, 2017.
-  Samuel Rota Bulò, Gerhard Neuhold, and Peter Kontschieder. Loss maxpooling for semantic image segmentation. In CVPR, 2017.
-  Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
-  Liang-Chieh Chen, Maxwell D. Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jonathon Shlens. Searching for efficient multi-scale architectures for dense image prediction. In NIPS, 2018.
-  Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
-  Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 2017.
-  Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
-  Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016.
-  Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
-  Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, pages 1971–1978, 2014.
-  Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. A^ 2-nets: Double attention networks. In NIPS, 2018.
-  Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.
-  Hao-Shu Fang, Guansong Lu, Xiaolin Fang, Jianwen Xie, Yu-Wing Tai, and Cewu Lu. Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. arXiv preprint arXiv:1805.04310, 2018.
-  Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. Dual attention network for scene segmentation. In CVPR, 2019.
-  Jun Fu, Jing Liu, Yuhang Wang, Jin Zhou, Changyong Wang, and Hanqing Lu. Stacked deconvolutional network for semantic segmentation. IEEE TIP, 2019.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Xuming He, Richard S. Zemel, and Miguel Á. Carreira-Perpiñán. Multiscale conditional random fields for image labeling. In CVPR, 2004.
-  Matthias Holschneider, Richard Kronland-Martinet, Jean Morlet, and Ph Tchamitchian. A real-time algorithm for signal analysis with the help of the wavelet transform. In Wavelets: Time-Frequency Methods and Phase Space, pages 289–297. 1989.
-  Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In CVPR, 2018.
-  Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea Vedaldi. Gather-excite: Exploiting feature context in convolutional neural networks. In NIPS, 2018.
-  Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.
-  Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In ICCV, 2019.
-  Jianbo Jiao, Yunchao Wei, Zequn Jie, Honghui Shi, Rynson WH Lau, and Thomas S Huang. Geometry-aware distillation for indoor semantic segmentation. In CVPR, 2019.
-  Lipeng Ke, Ming-Ching Chang, Honggang Qi, and Siwei Lyu. Multi-scale structure-aware network for human pose estimation. In ECCV, 2018.
-  Shu Kong and Charless C Fowlkes. Recurrent scene parsing with perspective understanding in the loop. In CVPR, 2018.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  L’ubor Ladickỳ, Paul Sturgess, Karteek Alahari, Chris Russell, and Philip HS Torr. What, where and how many? combining object detectors and crfs. In ECCV, 2010.
-  Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
-  Wenbo Li, Zhicheng Wang, Binyi Yin, Qixiang Peng, Yuming Du, Tianzi Xiao, Gang Yu, Hongtao Lu, Yichen Wei, and Jian Sun. Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148, 2019.
-  Xiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, and Xiaoou Tang. Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In CVPR, 2017.
-  Xiaodan Liang, Liang Lin, Xiaohui Shen, Jiashi Feng, Shuicheng Yan, and Eric P Xing. Interpretable structure-evolving lstm. In CVPR, 2017.
-  Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In CVPR, 2017.
-  Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
-  Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. arXiv preprint arXiv:1901.02985, 2019.
-  Wei Liu, Andrew Rabinovich, and Alexander C Berg. Parsenet: Looking wider to see better. arXiv:1506.04579, 2015.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In IEEE TPAMI, 2015.
-  Alejandro Newell, Zhiao Huang, and Jia Deng. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS, 2017.
-  Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
-  Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In ICCV, 2015.
George Papandreou, Iasonas Kokkinos, and Pierre-Andre Savalle.
Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection.In CVPR, 2015.
-  Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In CVPR, 2017.
-  Tobias Pohlen, Alexander Hermans, Markus Mathias, and Bastian Leibe. Full-resolution residual networks for semantic segmentation in street scenes. In CVPR, 2017.
-  Lorenzo Porzi, Samuel Rota Bulo, Aleksander Colovic, and Peter Kontschieder. Seamless scene segmentation. In CVPR, 2019.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI. Springer, 2015.
-  Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder. In-place activated batchnorm for memory-optimized training of dnns. In CVPR, 2018.
-  Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
-  Sohil Shah, Pallabi Ghosh, Larry S Davis, and Tom Goldstein. Stacked u-nets: a no-frills approach to natural image segmentation. arXiv preprint arXiv:1804.10343, 2018.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  Yuhang Song, Chao Yang, Yeji Shen, Peng Wang, Qin Huang, and C-C Jay Kuo. Spg-net: Segmentation prediction and guidance network for image inpainting. arXiv preprint arXiv:1805.03356, 2018.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
-  Joseph Tighe and Svetlana Lazebnik. Finding things: Image parsing with regions and per-exemplar detectors. In CVPR, 2013.
-  Zhuowen Tu, Xiangrong Chen, Alan L Yuille, and Song-Chun Zhu. Image parsing: Unifying segmentation, detection, and recognition. IJCV, 2005.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
-  Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. Understanding convolution for semantic segmentation. In WACV, 2018.
-  Tiantian Wang, Ali Borji, Lihe Zhang, Pingping Zhang, and Huchuan Lu. A stagewise refinement model for detecting salient objects in images. In ICCV, 2017.
-  Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, 2018.
-  Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In CVPR, 2016.
-  Zbigniew Wojna, Vittorio Ferrari, Sergio Guadarrama, Nathan Silberman, Liang-Chieh Chen, Alireza Fathi, and Jasper Uijlings. The devil is in the decoder: Classification, regression and gans. IJCV, pages 1–13, 2019.
-  Zifeng Wu, Chunhua Shen, and Anton van den Hengel. Bridging category-level and instance-level semantic image segmentation. arXiv:1605.06885, 2016.
-  Fangting Xia, Peng Wang, Xianjie Chen, and Alan L Yuille. Joint multi-person pose estimation and semantic part segmentation. In CVPR, 2017.
-  Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In ECCV, 2018.
-  Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In CVPR, 2018.
-  Tien-Ju Yang, Maxwell D Collins, Yukun Zhu, Jyh-Jing Hwang, Ting Liu, Xiao Zhang, Vivienne Sze, George Papandreou, and Liang-Chieh Chen. Deeperlab: Single-shot image parser. arXiv preprint arXiv:1902.05093, 2019.
-  Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature pyramids for human pose estimation. In ICCV, 2017.
-  Jian Yao, Sanja Fidler, and Raquel Urtasun. Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In CVPR, 2012.
-  Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV, 2018.
-  Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Learning a discriminative feature network for semantic segmentation. In CVPR, 2018.
-  Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In CVPR, 2018.
-  Rui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng Yan. Scale-adaptive convolutions for scene parsing. In ICCV, 2017.
-  Zhenli Zhang, Xiangyu Zhang, Chao Peng, Xiangyang Xue, and Jian Sun. Exfuse: Enhancing feature fusion for semantic segmentation. In ECCV, 2018.
-  Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
-  Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-wise spatial attention network for scene parsing. In ECCV, 2018.
-  Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, 2017.
Barret Zoph and Quoc V. Le.
Neural architecture search with reinforcement learning.In ICLR, 2017.