Context Encoding for Semantic Segmentation

03/23/2018 ∙ by Hang Zhang, et al. ∙ 6

Recent work has made significant progress in improving spatial resolution for pixelwise labeling with Fully Convolutional Network (FCN) framework by employing Dilated/Atrous convolution, utilizing multi-scale features and refining boundaries. In this paper, we explore the impact of global contextual information in semantic segmentation by introducing the Context Encoding Module, which captures the semantic context of scenes and selectively highlights class-dependent featuremaps. The proposed Context Encoding Module significantly improves semantic segmentation results with only marginal extra computation cost over FCN. Our approach has achieved new state-of-the-art results 51.7 model achieves a final score of 0.5567 on ADE20K test set, which surpass the winning entry of COCO-Place Challenge in 2017. In addition, we also explore how the Context Encoding Module can improve the feature representation of relatively shallow networks for the image classification on CIFAR-10 dataset. Our 14 layer network has achieved an error rate of 3.45 with state-of-the-art approaches with over 10 times more layers. The source code for the complete system are publicly available.



There are no comments yet.


page 1

page 3

page 5

page 6

Code Repositories


A PyTorch Extension Toolbox

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

Labeling a scene with accurate per-pixel labels is a challenge for semantic segmentation algorithms. Even humans find the task challenging. However, narrowing the list of probable categories based on scene context makes labeling much easier. Motivated by this, we introduce the Context Encoding Module which selectively highlights the class-dependent featuremaps and makes the semantic segmentation easier for the network. (Examples from ADE20K 


Semantic segmentation assigns per-pixel predictions of object categories for the given image, which provides a comprehensive scene description including the information of object category, location and shape. State-of-the-art semantic segmentation approaches are typically based on the Fully Convolutional Network (FCN) framework [37]

. The adaption of Deep Convolutional Neural Networks (CNNs) 

[29] benefits from the rich information of object categories and scene semantics learned from diverse set of images [10]. CNNs are able to capture the informative representations with global receptive fields by stacking convolutional layers with non-linearities and downsampling. For conquering the problem of spatial resolution loss associated with downsampling, recent work uses Dilated/Atrous convolution strategy to produce dense predictions from pre-trained networks [4, 54]. However, this strategy also isolates the pixels from the global scene context, leading to misclassified pixels. For example in the 3 row of Figure 21

, the baseline approach classifies some pixels in the

windowpane as door.

Recent methods have achieved state-of-the-art performance by enlarging the receptive field using multi-resolution pyramid-based representations. For example, PSPNet adopts Spatial Pyramid Pooling that pools the featuremaps into different sizes and concatenates them the after upsampling [59] and Deeplab proposes an Atrous Spatial Pyramid Pooling that employs large rate dilated/atrous convolutions  [5]. While these approaches do improve performance, the context representations are not explicit, leading to the questions: Is capturing contextual information the same as increasing the receptive field size? Consider labeling a new image for a large dataset (such as ADE20K [61] containing 150 categories) as shown in Figure 1. Suppose we have a tool allowing the annotator to first select the semantic context of the image, (e.g. a bedroom). Then, the tool could provide a much smaller sublist of relevant categories (e.g. bed, chair, etc.), which would dramatically reduce the search space of possible categories. Similarly, if we can design an approach to fully utilize the strong correlation between scene context and the probabilities of categories, the semantic segmentation becomes easier for the network.

Classic computer vision approaches have the advantage of capturing semantic context of the scene. For a given input image, hand-engineered features are densely extracted using SIFT 

[38] or filter bank responses [30, 48]. Then a visual vocabulary (dictionary) is often learned and the global feature statistics are described by classic encoders such as Bag-of-Words (BoW) [26, 8, 46, 13], VLAD [25]

or Fisher Vector 


. The classic representations encode global contextual information by capturing feature statistics. While the hand-crafted feature were improved greatly by CNN methods, the overall encoding process of traditional methods was convenient and powerful. Can we leverage the context encoding of classic approaches with the power of deep learning? Recent work has made great progress in generalizing traditional encoders in a CNN framework 

[1, 58]. Zhang et al. introduces an Encoding Layer that integrates the entire dictionary learning and residual encoding pipeline into a single CNN layer to capture orderless representations. This method has achieved state-of-the-art results on texture classification [58]. In this work, we extend the Encoding Layer to capture global feature statistics for understanding semantic context.

Figure 2: Overview of the proposed EncNet. Given an input image, we first use a pre-trained CNN to extract dense convolutional featuremaps. We build a Context Encoding Module on top, including an Encoding Layer to capture the encoded semantics and predict scaling factors that are conditional on these encoded semantics. These learned factors selectively highlight class-dependent featuremaps (visualized in colors). In another branch, we employ Semantic Encoding Loss (SE-loss) to regularize the training which lets the Context Encoding Module predict the presence of the categories in the scene. Finally, the representation of Context Encoding Module is fed into the last convolutional layer to make per-pixel prediction. (Notation: FC fully connected layer, Conv convolutional layer, Encode Encoding Layer [58], channel-wise multiplication.)

As the first contribution of this paper, we introduce a Context Encoding Module incorporating Semantic Encoding Loss (SE-loss), a simple unit to leverage the global scene context information. The Context Encoding Module integrates an Encoding Layer to capture global context and selectively highlight the class-dependent featuremaps. For intuition, consider that we would want to de-emphasize the probability of a vehicle to appear in an indoor scene. Standard training process only employs per-pixel segmentation loss, which does not strongly utilize global context of the scene. We introduce Semantic Encoding Loss (SE-loss) to regularize the training, which lets the network predict the presence of the object categories in the scene to enforce network learning of semantic context. Unlike per-pixel loss, SE-loss gives an equal contributions for both big and small objects and we find the performance of small objects are often improved in practice. The proposed Context Encoding Module and Semantic Encoding Loss are conceptually straight-forward and compatible with existing FCN based approaches.

The second contribution of this paper is the design and implementation of a new semantic segmentation framework Context Encoding Network (EncNet). EncNet augments a pre-trained Deep Residual Network (ResNet) [17] by including a Context Encoding Module as shown in Figure 2. We use dilation strategy [54, 4] of pre-trained networks. The proposed Context Encoding Network achieves state-of-the-art results 85.9% mIoU on PASCAL VOC 2012 and 51.7% on PASCAL in Context. Our single model of EncNet-101 has achieved a score of 0.5567 which surpass the winning entry of COCO-Place Challenge 2017 [61]. In addition to semantic segmentation, we also study the power of our Context Encoding Module for visual recognition on CIFAR-10 dataset [28] and the performance of shallow network is significantly improved using the proposed Context Encoding Module. Our network has achieved an error rate of using only

parameters. We release the complete system including state-of-the-art approaches together with our implementation of synchronized multi-GPU Batch Normalization 

[23] and memory-efficient Encoding Layer [58].

Figure 3: Dilation strategy and losses. Each cube denotes different network stages. We apply dilation strategy to the stage 3 and 4. The Semantic Encoding Losses (SE-loss) are added to both stage 3 and 4 of the base network. (D denotes the dilation rate, Seg-loss represents the per-pixel segmentation loss.)

2 Context Encoding Module

We refer to the new CNN module as Context Encoding Module and the components of the module are illustrated in Figure 2.

Context Encoding

Understanding and utilizing contextual information is very important for semantic segmentation. For a network pre-trained on a diverse set of images [10], the featuremaps encode rich information what objects are in the scene. We employ the Encoding Layer [58] to capture the feature statistics as a global semantic context. We refer to the output of Encoding Layer as encoded semantics. For utilizing the context, a set of scaling factors are predicted to selectively highlight the class-dependent featuremaps. The Encoding Layer learns an inherent dictionary carrying the semantic context of the dataset and outputs the residual encoders with rich contextual information. We briefly describe the prior work of Encoding Layer for completeness.

Encoding Layer considers an input featuremap with the shape of as a set of -dimensional input features , where is total number of features given by , which learns an inherent codebook containing number of codewords (visual centers) and a set of smoothing factor of the visual centers . Encoding Layer outputs the residual encoder by aggregating the residuals with soft-assignment weights , where


and the residuals are given by . We apply aggregation to the encoders instead of concatenation. That is, , where

denotes Batch Normalization with ReLU activation, avoid making

independent encoders to be ordered and also reduce the dimensionality of the feature representations.

Featuremap Attention

To make use of the encoded semantics captured by Encoding Layer, we predict scaling factors of featuremaps as a feedback loop to emphasize or de-emphasize class-dependent featuremaps. We use a fully connected layer on top of the Encoding Layer and a sigmoid as the activation function, which outputs predicted featuremap scaling factors

, where denotes the layer weights and

is the sigmoid function. Then the module output is given by

a channel wise multiplication between input featuremaps and scaling factor . This feedback strategy is inspired by prior work in style transfer [57, 22] and a recent work SE-Net [20] that tune featuremap scale or statistics. As an intuitive example of the utility of the approach, consider emphasizing the probability of an airplane in a sky scene, but de-emphasizing that of a vehicle.

Semantic Encoding Loss

In standard training process of semantic segmentation, the network is learned from isolated pixels (per-pixel cross-entropy loss for given input image and ground truth labels). The network may have difficulty understanding context without global information. To regularize the training of Context Encoding Module, we introduce Semantic Encoding Loss (SE-loss) which forces the network to understand the global semantic information with very small extra computation cost. We build an additional fully connected layer with a sigmoid activation function on top of the Encoding Layer to make individual predictions for the presences of object categories in the scene and learn with binary cross entropy loss. Unlike per-pixel loss, SE-loss considers big and small objects equally. In practice, we find the segmentation of small objects are often improved. In summary, the Context Encoding Module shown in Figure 2 captures the semantic context to predict a set of scaling factors that selectively highlights the class-dependent featuremap for semantic segmentation.

2.1 Context Encoding Network (EncNet)

(a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (a) Image (b) Ground Truth (c) FCN (baseline) (d) EncNet (ours)
(e) Legend
Figure 21: Understanding contextual information of the scene is important for semantic segmentation. For example, baseline FCN classifies sand as earth without knowing the context as in 1 example. building, house and skyscraper are hard to distinguish without the semantics as in 2 and 4 rows. In the 3 example, FCN identify windowpane as door due to classifying isolated pixels without a global sense/view. (Visual examples from ADE20K dataset.)

With the proposed Context Encoding Module, we build a Context Encoding Network (EncNet) with pre-trained ResNet [17]. We follow the prior work using dilated network strategy on pre-trained network [6, 59, 55] at stage 3 and 4222We refer to the stage with original featuremap size 1/16 as stage 3 and size 1/32 as stage 4., as shown in Figure 3. We build our proposed Context Encoding Module on top of convolutional layers right before the final prediction, as shown in Figure 2. For further improving the performance and regularizing the training of Context Encoding Module, we make a separate branch to minimize the SE-loss that takes the encoded semantics as input and predicts the presence of the object classes. As the Context Encoding Module and SE-loss are very light weight, we build another Context Encoding Module on top of stage 3 to minimize the SE-loss as an additional regularization, similar to but much cheaper than the auxiliary loss of PSPNet [59]. The ground truths of SE-loss are directly generated from the ground-truth segmentation mask without any additional annotations.

Our Context Encoding Module is differentiable and inserted in the existing FCN pipeline without any extra training supervision or modification of the framework. In terms of computation, the proposed EncNet only introduces marginal extra computation to the original dilated FCN network.

max width=

Figure 22: Ablation study of SE-loss and number of codewords. Left: mIoU and pixAcc as a function of SE-loss weight . Empirically, the SE-loss works best with . Right: mIoU and pixAcc as a function of number of codewords in Encoding Layer, denotes using global average pooling. The results are tested using single scale evaluation. (Note: the axes are different on left and right sides. )

2.2 Relation to Other Approaches

Segmentation Approaches

CNN has become de facto standard in computer vision tasks including semantic segmentation. The early approaches generate segmentation masks by classifying region proposals [14, 15]. Fully Convolutional Neural Network (FCN) pioneered the era of end-to-end segmentation [37]. However, recovering detailed information from downsampled featuremaps is difficult due to the use of pre-trained networks that are originally designed for image classification. To address this difficulty, one way is to learn the upsampling filters, i.e.

 fractionally-strided convolution or decoders 

[3, 41]. The other path is to employ Atrous/Dilated convolution strategy to the network [4, 54] which preserves the large receptive field and produces dense predictions. Prior work adopts dense CRF taking FCN outputs to refine the segmentation boundaries [5, 7], and CRF-RNN achieves end-to-end learning of CRF with FCN [60]. Recent FCN-based work dramatically boosts performance by increasing the receptive field with larger rate atrous convolution or global/pyramid pooling [6, 35, 59]. However, these strategies have to sacrifice the efficiency of the model, for example PSPNet [59] applies convolutions on flat featuremaps after Pyramid Pooling and upsampling and DeepLab [5] employs large rate atrous convolution that will degenerate to convolution in extreme cases. We propose the Context Encoding Module to efficiently leverage global context for semantic segmentation, which only requires marginal extra computation costs. In addition, the proposed Context Encoding Module as a simple CNN unit is compatible with all existing FCN-based approaches.

Featuremap Attention and Scaling

The strategy of channel-wise featuremap attention is inspired by some pioneering work. Spatial Transformer Network 

[24] learns an in-network transformation conditional on the input which provides a spatial attention to the featuremaps without extra supervision. Batch Normalization [23]

makes the normalization of the data mean and variance over the mini-batch as part of the network, which successfully allows larger learning rate and makes the network less sensitive to the initialization method. Recent work in style transfer manipulates the featuremap mean and variance 

[11, 22] or second order statistics to enable in-network style switch [57]. A very recent work SE-Net explores the cross channel information to learn a channel-wise attention and has achieved state-of-the-art performance in image classification [20]. Inspired by these methods, we use encoded semantics to predict scaling factors of featuremap channels, which provides a mechanism to assign saliency by emphasizing or de-emphasizing individual featuremaps conditioned on scene context.

3 Experimental Results

Method BaseNet Encoding SE-loss MS pixAcc% mIoU%
FCN Res50 73.4 41.0
EncNet Res50 78.1 47.6
EncNet Res50 79.4 49.2
EncNet Res101 80.4 51.7
EncNet Res101 81.2 52.6
Table 1: Ablation study on PASCAL-Context dataset. Encoding represents Context Encoding Module, SE-loss is the proposed Semantic Segmentation loss, MS means multi-size evaluation. Notably, applying Context Encoding Module only introduce marginal extra computation, but the performance is significantly improved. (PixAcc and mIoU calculated on 59 classes w/o background.)

In this section, we first provide implementation details for EncNet and baseline approach, then we conduct a complete ablation study on Pascal-Context dataset [40], and finally we report the performances on PASCAL VOC 2012 [12] and ADE20K [61] datasets. In addition to semantic segmentation, we also explore how the Context Encoding Module can improve the image classification performance of shallow network on CIFAR-10 dataset in Sec 3.5.

3.1 Implementation Details

Our experiment system including pre-trained models are based on open source toolbox PyTorch 

[42]. We apply dilation strategy to stage 3 and 4 of the pre-trained networks with the output size of 1[4, 54]

. The output predictions are upsampled 8 times using bilinear interpolation for calculating the loss 

[6]. We follow prior work [59, 5] to use the learning rate scheduling

. The base learning rate is set to 0.01 for ADE20K dataset and 0.001 for others and the power is set to 0.9. The momentum is set to 0.9 and weight decay is set to 0.0001. The networks are training for 50 epochs on PASCAL-Context 

[40] and PASCAL VOC 2012 [12], and 120 epochs on ADE20K [61]

. We randomly shuffle the training samples and discard the last mini-batch. For data augmentation, we randomly flip and scale the image between 0.5 to 2 and then randomly rotate the image between -10 to 10 degree and finally crop the image into fix size using zero padding if needed. For evaluation, we average the network prediction in multiple scales following 

[59, 35, 45].

In practice, larger crop size typically yields better performance for semantic segmentation, but also consumes larger GPU memory which leads to much smaller working batchsize for Batch Normalization [23] and degrades the training. To address this difficulty, we implement Synchronized Cross-GPU Batch Normalization in PyTorch using NVIDIA CUDA & NCCL toolkit, which increases the working batchsize to be global mini-batch size (discussed in Appendix A). We use the mini-batch size of 16 during the training. For comparison with our work, we use dilated ResNet FCN as baseline approaches. For training EncNet, we use the number of codewords 32 in Encoding Layers. The ground truth labels for SE-loss are generated by ‘‘unique’’ operation finding the categories presented in the given ground-truth segmentation mask. The final loss is given by a weighted sum of per-pixel segmentation loss and SE-Loss.

Evaluation Metrics We use standard evaluation metrics of pixel accuracy (pixAcc) and mean Intersection of Union (mIoU). For object segmentation in PASCAL VOC 2012 dataset, we use the official evaluation server that calculates mIoU considering the background as one of the categories. For whole scene parsing datasets PASCAL-Context and ADE20K, we follow the standard competition benchmark [61] to calculate mIoU by ignoring background pixels.

3.2 Results on PASCAL-Context

Method BaseNet mIoU%
FCN-8s [37] 37.8
CRF-RNN [60] 39.3
ParseNet [35] 40.4
BoxSup [9] 40.5
HO_CRF [2] 41.3
Piecewise [32] 43.3
VeryDeep [51] 44.5
DeepLab-v2 [5] Res101-COCO 45.7
RefineNet [31] Res152 47.3
EncNet (ours) Res101 51.7
Table 2: Segmentation results on PASCAL-Context dataset. (Note: mIoU on 60 classes w/ background.)

PASCAL-Context dataset [40] provides dense semantic labels for the whole scene, which has 4,998 images for training and 5105 for test. We follow the prior work [40, 5, 31] to use the semantic labels of the most frequent 59 object categories plus background (60 classes in total). We use the pixAcc and mIoU for 59 classes as evaluation metrics in the ablation study of EncNet. For comparing to prior work, we also report the mIoU using 60 classes in Table 2 (considering the background as one of the classes).

(a) Image
(b) Ground Truth
(c) FCN
(d) EncNet (ours)
Figure 39: Visual examples in PASCAL-Context dataset. EncNet produce more accurate predictions.

max width= Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mIoU FCN [37] 76.8 34.2 68.9 49.4 60.3 75.3 74.7 77.6 21.4 62.5 46.8 71.8 63.9 76.5 73.9 45.2 72.4 37.4 70.9 55.1 62.2 DeepLabv2 [4] 84.4 54.5 81.5 63.6 65.9 85.1 79.1 83.4 30.7 74.1 59.8 79.0 76.1 83.2 80.8 59.7 82.2 50.4 73.1 63.7 71.6 CRF-RNN [60] 87.5 39.0 79.7 64.2 68.3 87.6 80.8 84.4 30.4 78.2 60.4 80.5 77.8 83.1 80.6 59.5 82.8 47.8 78.3 67.1 72.0 DeconvNet [41] 89.9 39.3 79.7 63.9 68.2 87.4 81.2 86.1 28.5 77.0 62.0 79.0 80.3 83.6 80.2 58.8 83.4 54.3 80.7 65.0 72.5 GCRF [49] 85.2 43.9 83.3 65.2 68.3 89.0 82.7 85.3 31.1 79.5 63.3 80.5 79.3 85.5 81.0 60.5 85.5 52.0 77.3 65.1 73.2 DPN [36] 87.7 59.4 78.4 64.9 70.3 89.3 83.5 86.1 31.7 79.9 62.6 81.9 80.0 83.5 82.3 60.5 83.2 53.4 77.9 65.0 74.1 Piecewise [32] 90.6 37.6 80.0 67.8 74.4 92.0 85.2 86.2 39.1 81.2 58.9 83.8 83.9 84.3 84.8 62.1 83.2 58.2 80.8 72.3 75.3 ResNet38[52] 94.4 72.9 94.9 68.8 78.4 90.6 90.0 92.1 40.1 90.4 71.7 89.9 93.7 91.0 89.1 71.3 90.7 61.3 87.7 78.1 82.5 PSPNet [59] 91.8 71.9 94.7 71.2 75.8 95.2 89.9 95.9 39.3 90.7 71.7 90.5 94.5 88.8 89.6 72.8 89.6 64.0 85.1 76.3 82.6 EncNet (ours) 94.1 69.2 96.3 76.7 86.2 96.3 90.7 94.2 38.8 90.7 73.3 90.0 92.5 88.8 87.9 68.7 92.6 59.0 86.4 73.4 82.9
With COCO Pre-training

CRF-RNN [60]
90.4 55.3 88.7 68.4 69.8 88.3 82.4 85.1 32.6 78.5 64.4 79.6 81.9 86.4 81.8 58.6 82.4 53.5 77.4 70.1 74.7
Dilation8 [54] 91.7 39.6 87.8 63.1 71.8 89.7 82.9 89.8 37.2 84.0 63.0 83.3 89.0 83.8 85.1 56.8 87.6 56.0 80.2 64.7 75.3 DPN [36] 89.0 61.6 87.7 66.8 74.7 91.2 84.3 87.6 36.5 86.3 66.1 84.4 87.8 85.6 85.4 63.6 87.3 61.3 79.4 66.4 77.5 Piecewise [32] 94.1 40.7 84.1 67.8 75.9 93.4 84.3 88.4 42.5 86.4 64.7 85.4 89.0 85.8 86.0 67.5 90.2 63.8 80.9 73.0 78.0 DeepLabv2 [5] 92.6 60.4 91.6 63.4 76.3 95.0 88.4 92.6 32.7 88.5 67.6 89.6 92.1 87.0 87.4 63.3 88.3 60.0 86.8 74.5 79.7 RefineNet[31] 95.0 73.2 93.5 78.1 84.8 95.6 89.8 94.1 43.7 92.0 77.2 90.8 93.4 88.6 88.1 70.1 92.9 64.3 87.7 78.8 84.2 ResNet38[52] 96.2 75.2 95.4 74.4 81.7 93.7 89.9 92.5 48.2 92.0 79.9 90.1 95.5 91.8 91.2 73.0 90.5 65.4 88.7 80.6 84.9 PSPNet [59] 95.8 72.7 95.0 78.9 84.4 94.7 92.0 95.7 43.1 91.0 80.3 91.3 96.3 92.3 90.1 71.5 94.4 66.9 88.8 82.0 85.4 DeepLabv3[6] 96.4 76.6 92.7 77.8 87.6 96.7 90.2 95.4 47.5 93.4 76.3 91.4 97.2 91.0 92.1 71.3 90.9 68.9 90.8 79.3 85.7 EncNet (ours) 95.3 76.9 94.2 80.2 85.2 96.5 90.8 96.3 47.9 93.9 80.0 92.4 96.6 90.5 91.5 70.8 93.6 66.5 87.7 80.8 85.9

Table 3: Per-class results on PASCAL VOC 2012 testing set. EncNet outperforms existing approaches and achieves 82.9% and 85.9% mIoU w/o and w/ pre-training on COCO dataset. (The best two entries in each columns are marked in gray color. Note: the entries using extra than COCO data are not included [39, 6, 50].)

Ablation Study.

To evaluate the performance of EncNet, we conduct experiments with different settings as shown in Table 1. Comparing to baseline FCN, simply adding a Context Encoding Module on top yields results of 78.1/47.6 (pixAcc and mIoU), which only introduces around 3%-5% extra computation but dramatically outperforms the baseline results of 73.4/41.0. To study the effect of SE-loss, we test different weights of SE-loss {0.0, 0.1, 0.2, 0.4, 0.8}, and we find yields the best performance as shown in Figure 22 (left). We also study effect of the number of codewords in Encoding Layer in Figure 22 (right), we use because the improvement gets saturated ( means using global average pooling instead). Deeper pre-trained network provides better feature representations, EncNet gets additional 2.5% improvement in mIoU employing ResNet101. Finally, multi-size evaluation yields our final scores of 81.2% pixAcc and 52.6% mIoU, which is 51.7% including background. Our proposed EncNet outperform previous state-of-the-art approaches [5, 31] without using COCO pre-training or deeper model (ResNet152) (see results in Table 2 and Figure 39).

3.3 Results on PASCAL VOC 2012

We also evaluate the performance of proposed EncNet on PASCAL VOC 2012 dataset [12], one of gold standard benchmarks for semantic segmentation. Following [37, 9, 6], We use the augmented annotation set  [16], consisting of 10,582, 1,449 and 1,456 images in training, validation and test set. The models are trained on train+val set and then finetuned on the original PASCAL training set. EncNet has achieved 82.9% mIoU333 outperforming all previous work without COCO data and achieve superior performance in many categories, as shown in Table 3. For comparison with state-of-the-art approaches, we follow the procedure of pre-training on MS-COCO dataset [33]. From the training set of MS-COCO dataset, we select with images containing the 20 classes shared with PASCAL dataset with more than 1,000 labeled pixels, resulting in 6.5K images. All the other classes are marked as background. Our model is pre-trained using a base learning rate of 0.01 and then fine-tuned on PASCAL dataset using aforementioned setting. EncNet achieves the best result of 85.9% mIoU444 as shown in Table 3. Comparing to state-of-the-art approaches of PSPNet [59] and DeepLabv3 [6], the EncNet has less computation complexity.

Method BaseNet pixAcc% mIoU%
FCN [37] 71.32 29.39
SegNet [3] 71.00 21.64
DilatedNet [54] 73.55 32.31
CascadeNet [61] 74.52 34.90
RefineNet [31] Res152 - 40.7
PSPNet [59] Res101 81.39 43.29
PSPNet [59] Res269 81.69 44.94
FCN (baseline) Res50 74.57 34.38
EncNet (ours) Res50 79.73 41.11
EncNet (ours) Res101 81.69 44.65
Table 4: Segmentation results on ADE20K validation set.
rank Team Final Score
- (EncNet-101, single model ours) 0.5567
1 CASIA_IVA_JD 0.5547
2 WinterIsComing 0.5544
- (PSPNet-269, single model) [59] 0.5538
Table 5: Result on ADE20K test set, ranks in COCO-Place challenge 2017. Our single model surpass PSP-Net-269 (1st place in 2016) and the winning entry of COCO-Place challenge 2017 [61].

3.4 Results on ADE20K

ADE20K dataset [61] is a recent scene parsing benchmark containing dense labels of 150 stuff/object category labels. The dataset includes 20K/2K/3K images for training, validation and set. We train our EncNet on the training set and evaluate it on the validation set using PixAcc and mIoU. Visual examples are shown in Figure 21. The proposed EncNet significantly outperforms the baseline FCN. EncNet-101 achieves comparable results with state-of-the-art PSPNet-269 using much shallower base network as shown in Table 4. We fine-tune the EncNet-101 for additional 20 epochs on train-val set and submit the results on test set. The EncNet achieves a final score of 0.5567555Evaluation provided by the ADE20K organizers., which surpass PSP-Net-269 (1st place in 2016) and all entries in COCO Place Challenge 2017 (shown in Table 5).

3.5 Image Classification Results on CIFAR-10

In addition to semantic segmentation, we also conduct studies of Context Encoding Module for image recognition on CIFAR-10 dataset [28] consisting of 50K training images and 10K test images in 10 classes. State-of-the-art methods typically rely on very deep and large models [21, 17, 19, 53]. In this section, we explore how much Context Encoding Module will improve the performance of a relatively shallow network, a 14-layer ResNet [17].

Implementation Details.

For comparison with our work, we first implement a wider version of pre-activation ResNet [19] and a recent work Squeeze-and-Excitation Networks (SE-Net) [20] as our baseline approaches. ResNet consists a 33 convolutional layer with 64 channels, followed by 3 stages with 2 basicblocks in each stage and ends up with a global average pooling and a 10-way fully-connected layer. The basicblock consists two 33 convolutional layers with an identity shortcut. We downsample twice at stage 2 and 3, the featuremap channels are doubled when downsampling happens. We implement SE-Net [20] by adding a Squeeze-and-Excitation unit on top of each basicblocks of ResNet (to form a SE-Block), which uses the cross channel information as a feedback loop. We follow the original paper using a reduction factor of 16 in SE-Block. For EncNet, we build Context Encoding Module on top of each basicblocks in ResNet, which uses the global context to predict the scaling factors of residuals to preserve the identity mapping along the network. For Context Encoding Module, we first use a 11 convolutional layer to reduce the channels by 4 times, then apply Encoding Layer with concatenation of encoders and followed by a L2 normalization.

width=0.8 Method Depth Params Error ResNet (pre-act) [19] 1001 10.2M 4.62
Wide ResNet 2810  [56]
28 36.5M 3.89

ResNeXt-29 1664d [53]
29 68.1M 3.58

DenseNet-BC (k=40) [21]
190 25.6M 3.46
ResNet 64d (baseline) 14 2.7M 4.93
Se-ResNet 64d (baseline)
14 2.8M 4.65

EncNet 16k64d (ours)
14 3.5M 3.96

EncNet 32k128d (ours)
14 16.8M 3.45

Table 6: Comparison of model depth, number of parameters (M), test errors (%) on CIFAR-10. denotes the dimensions/channels at network stage-1, and denotes number of codewords in Encoding Net.

For training, we adopt the MSRA weight initialization [18] and use Batch Normalization [23] with weighted layers. We use a weight decay of 0.0005 and momentum of 0.9. The models are trained with a mini-batch size of 128 on two GPUs using a cosine learning rate scheduling [21] for 600 epochs. We follow the standard data augmentation [17] for training, which pads the image by 4 pixels along each border and random crops into the size of 3232. During the training of EncNet, we collect the statistics of the scaling factor of Encoding Layers and find it tends to be 0.5 with small variance. In practice, when applying a dropout [47]/shakeout [27] like regularization to can improve the training to reach better optimum, by randomly assigning the scaling factors

in Encoding Layer during the forward and backward passes of the training, drawing a uniform distribution between 0 and 1, and setting

for evaluation.

max height=3.7cm

Figure 40: Train and validation curves of EncNet-32k64d and the baseline Se-ResNet-64d on CIFAR-10 dataset, plotting error rate as a function of epochs.

We find our training process (larger training epochs with cosine lr schedule) is likely to improve the performance of all approaches. EncNet outperforms the baseline approaches with similar model complexity. The experimental results demonstrate that Context Encoding Module improves the feature representations of the network at an early stage using global context, which is hard to learn for a standard network architecture only consisting convolutional layers, non-linearities and downsamplings. Our experiments shows that a shallow network of 14 layers with Context Encoding Module has achieved 3.45% error rate on CIFAR10 dataset as shown in Table 6, which is comparable performance with state-of-the art approaches [21, 53].

4 Conclusion

To capture and utilize the contextual information for semantic segmentation, we introduce a Context Encoding Module, which selectively highlights the class-dependent featuremap and ‘‘simplifies" the problem for the network. The proposed Context Encoding Module is conceptually straightforward, light-weight and compatible with existing FCN base approaches. The experimental results has demonstrated superior performance of the proposed EncNet. We hope the strategy of Context Encoding and our state-of-the-art implementation (including baselines, Synchronized Cross-GPU Batch Normalization and Encoding Layer) can be beneficial to scene parsing and semantic segmentation work in the community.


The authors would like to thank Sean Liu from Amazon Lab 126, Sheng Zha and Mu Li from Amazon AI for helpful discussions and comments. We thank Amazon Web Service (AWS) for providing free EC2 access.


Appendix A Implementation Details on Synchronized Cross-GPU Batch Normalization

We implement synchronized cross-gpu batch normalization (SyncBN) on PyTorch [42] using NVIDIA NCCL Toolkit. Concurrent work also implement SyncBN by first calculating the global mean and then the variance, which requires synchronizing twice in each iteration [34, 43]. Instead, our implementation only requires synchronizing one time by applying a simple strategy: for the number of given input samples , the variance can be represented by


where . We first calculate and individually on each device, then the global sums are calculated by applying all reduce operation. The global mean and variance are calculated using Equation 2 and the normalization is performed for each sample  [23]. Similarly, we synchronize once for the gradients of and during the back-propagation.