OCNet achieves the state-of-the-art scene parsing performance on both Cityscapes and ADE20K.
Context is essential for various computer vision tasks. The state-of-the-art scene parsing methods have exploited the effectiveness of the context defined over image-level. Such context carries the mixture of objects belonging to different categories. According to that the label of each pixel P is defined as the category of the object it belongs to, we propose the pixel-wise Object Context that consists of the objects belonging to the same category with pixel P. The representation of pixel P's object context is the aggregation of all the features that belong to the pixels sharing the same category with P. Since the ground truth objects that the pixel P belonging to is unavailable, we employ the self-attention method to approximate the objects by learning a pixel-wise similarity map. We further propose the Pyramid Object Context and Atrous Spatial Pyramid Object Context to capture context of multiple scales. Based on the object context, we introduce the OCNet and show that OCNet achieves state-of-the-art performance on both Cityscapes benchmark and ADE20K benchmark. The code of OCNet will be made available at https://github.com/PkuRainBow/OCNet.READ FULL TEXT VIEW PDF
Scene parsing is challenging for unrestricted open vocabulary and divers...
We present a collection of 24 multiple object scenes each recorded under...
This paper presents a framework for predicting affordances of object par...
Children benefit from lift-the-flap books by taking on an active role in...
Humans describe images in terms of nouns and adjectives while algorithms...
Semantic segmentation is a fundamental problem in computer vision. It is...
Scene parsing is a technique that consist on giving a label to all pixel...
OCNet achieves the state-of-the-art scene parsing performance on both Cityscapes and ADE20K.
Scene parsing is a fundamental topic in computer vision and is critical for various challenging tasks such as autonomous driving and virtual reality. The goal is to predict the label of each pixel, i.e., the category label of the object that the pixel belongs to.
Various techniques based on deep convolutional neural networks have been developed for scene parsing since the pioneering fully convolutional network approach. There are two main paths to tackle the segmentation problem. The first path is to raise the resolution of response maps for improving the spatial precision, e.g., through dilated convolutions [2, 31]. The second path is to exploit the context [2, 31, 34] for improving the labeling robustness, which our work belongs to.
Existing representative works mainly exploit the context formed from spatially nearby or sampled pixels. For instance, the pyramid pooling module in PSPNet  partitions the feature maps into multiple regions, and the pixels lying within each region are regarded as the context of the pixel belonging to the region. The atrous spatial pyramid pooling module (ASPP) in DeepLabv3  regards spatially regularly sampled pixels at different atrous rates as the context of the center pixel. Such spatial context is a mixture of pixels that might belong to different object categories, thus the resulting representations obtained from context aggregation are limitedly reliable for label prediction.
Motivated by that the label of a pixel in an image is the category of the object that the pixel belongs to, we present a so-called object context for each pixel, which is the set of pixels that belong to the same object category with such a pixel. We propose a novel object context pooling (OCP) to aggregate the information according to the object context. We compute a similarity map for each pixel , where each similarity score indicates the degree that the corresponding pixel and the pixel belongs to the same category. We call such similarity map as object context map, which serves as a surrogate of the true object context. Figure 1 shows several examples of object context map.
We exploit the object context to update the representation for each pixel. The implementation of object context pooling, inspired by the self-attention approach [14, 23], computes the weighted summation of the representations of all the pixels contained in the object context, with the weights from the object context map.
We further present two extensions: (i) pyramid object context, which performs object context pooling in each region in the spatial pyramid and follows the pyramid design introduced in PSPNet . (ii) atrous spatial pyramid object context, which combines ASPP  and object context pooling. We demonstrate our proposed approaches by state-of-the-art performance on two challenging scene parsing datasets, Cityscapes and ADE20K, and the challenging human parsing dataset LIP.
There exist two main challenges, (i) resolution: there exists a huge gap between the output feature map’s resolution and the input image’s resolution. (e.g., the output feature map of ResNet- is or of the input image’s size when we use dilated convolution  or not.) (ii) multi-scale: there exist objects of various scales, especially in the urban scene images such as Cityscapes . Most of the recent works are focused on solving these two challenges.
To handle the problem of resolution, we adopt the dilated convolution within OCNet by following the same settings of PSPNet and DeepLabv3. Besides, it is important to capture information of multiple scales to alleviate the problem caused by multi-scale objects. PSPNet applies PPM (pyramid pooling module) while DeepLabv3 employs the image-level feature augmented ASPP (atrous spatial pyramid pooling). OCNet captures the multi-scale context information by employing object context pooling over regions of multiple scales.
The context plays an important role in various computer vision tasks and it is of various forms such as global scene context, geometric context, relative location, 3D layout and so on. Context has been investigated for both object detection [5, 17] and part detection .
The importance of context for semantic segmentation is also verified in the recent works [16, 34, 3]. We define the context as a set of pixels in the literature of semantic segmentation. Especially, we can partition the conventional context to two kinds: (i) nearby spatial context: the ParseNet  treats all the pixels over the whole image as the context, and the PSPNet  employs pyramid pooling over sub-regions of four pyramid scales and all the pixels within the same sub-region are treated as the context for the pixels belonging to the sub-region. (ii) sampled spatial context: the DeepLabv3 employs multiple atrous convolutions with different atrous rates to capture spatial pyramid context information and regards these spatially regularly sampled pixels as the context. Both these two kinds of context are defined over rigid rectangle regions and carry pixels belonging to various object categories.
Different from the conventional context, object context is defined as the set of pixels belonging to the same object category.
Attention is widely used for various tasks such as machine translation, visual question answering and video classification. The self-attention [14, 23] method calculates the context at one position as a weighted sum of all positions in a sentence. Wang et al. further proposed the non-local neural network  for vision tasks such as video classification, object detection and instance segmentation based on the self-attention method.
Our work is inspired by the self-attention approach and we mainly employ the self-attention method to learn the object context map recording the similarities between all the pixels and the associated pixel . The concurrent DANet  also exploits the self-attention method for segmentation, and OCNet outperforms the DANet on the test set of Cityscapes and DANet is not evaluated on the ADE20K and LIP benchmarks.
Besides, the concurrent work PSANet  is also different from our method. The PSANet constructs the pixel-wise attention map based on each pixel independently while OCNet constructs the object context map by considering the pair-wise similarities among all the pixels.
Given an image , the goal of scene parsing is to assign a label to each pixel, where the label is the category of the object the pixel belongs to, outputting a segmentation (or label) map .
Pipeline. Our approach feeds the input image to a fully convolution network (e.g., a part of a ResNet), outputting a feature map of size , then lets the feature map go through an object context module, yielding an updated feature map , next predicts the label for each pixel according to the updated feature map, and up-samples the label map for times at last. The whole structure is called OCNet, and our key contribution to scene parsing lies in the object context module. The pipeline is given in Figure 2 (a).
The intuition of the object context is to represent a pixel by exploiting the representations of other pixels lying in the object that belongs to the same category.
. The object context pooling includes two main steps: object context estimation and object context aggregation.
Object context pooling. (i) Object context estimation. The object context for each pixel is defined as as a set of pixels that belong to the same object category as the pixel
. We compute an object context map, denoted in a vector form by111We use the vector form to represent the 2D map for description convenience. for the pixel , indicating the degrees that each other pixel and the pixel belong to the same object category. The object context map is a surrogate for the true object context. The computation of object context map is given as follows,
where and are the representation vectors of the pixels and . The normalization number is a summation of all the similarities: , where . and are the query transform function and the key transform function.
(ii) Object context aggregation. We construct the object context representation of the pixel by aggregating the representations of the pixels according to the object context map as below,
where is the value transform function following the self-attention.
Base object context. We employ an object context pooling to aggregate the object context information according to the object context map of each pixel, and concatenate the output feature map by OCP with the input feature map as the output. We call the resulting method as Base-OC. More details are illustrated in Figure 2 (b).
Pyramid object context. We partition the image into regions using four pyramid scales: region, regions, regions, and regions, which is similar to PSPNet , and we update the feature maps for each scale by feeding the feature map of each region into the object context pooling separately, then we combine the four updated feature maps together. The pyramid object context module has the capability of purifying the object context map by removing spatially far but appearance similar pixels that belong to different object categories. Finally, we concatenate the multiple pyramid object context representations with the input feature map. We call the resulting method as Pyramid-OC. More details are illustrated in Figure 2 (c).
Combination with ASPP. The atrous spatial pyramid pooling (ASPP) consists of five branches: an image-level pooling branch, a convolution branch and three dilated convolution branches with dilation rates being , and
, respectively over the feature map with output stride of. We connect four among the five branches except the image-level pooling branch and our object context pooling in parallel, resulting in a method which we name as ASP-OC. More details are illustrated in Figure 2 (d).
Backbone. We use the ResNet-
pretrained over the ImageNet dataset as the backbone, and make some modifications by following PSPNet: replace the convolutions within the last two blocks by dilated convolutions with dilation rates being and , respectively, so that the output stride becomes .
Object context module. We construct the Base-OC module, Pyramid-OC module and ASP-OC module by employing an extra convolution on the output feature map of Base-OC, Pyramid-OC and ASP-OC.
The detailed architecture of Base-OC module is given as follows. Before feeding the feature map into the OCP, we employ a dimension reduction module (a convolution) to reduce the channels of the feature maps output from the backbone from to . Then we feed the updated feature map into the OCP and concatenate the output feature map of the OCP with the input feature map to the OCP. We further employ a convolution to decrease the channels of the concatenated feature map from to .
For the Pyramid-OC module, we also employ a convolution to reduce the channels from to in advance, then we feed the dimension reduced feature map to the Pyramid-OC and employ four different pyramid partitions ( region, regions, regions, and regions) on the input feature map, and we concatenate the four different output object context feature maps output by the four parallel OCPs. Each one of the four object context feature maps has channels. We employ a convolution to increase the channel of the input feature map from to and concatenate it with all the four object context feature maps. Lastly, we employ a convolution on the concatenated feature map with channels and produce the final feature map with channels.
For the ASP-OC module, we only employ the dimension reduction within the object context pooling branch, where we employ a convolution to reduce the channel from to . The output feature map from object context pooling module has channels. For the other four branches, we exactly follow the original ASPP module and employ a convolution within the second above branch and dilated convolution with different dilation rates (, , ) in the remained three parallel branches except that we change the output channel from to in all of these four branches. To ensure the fairness of our experiments, we also increase the channel dimension from to within the original ASPP in all of our experiments. Lastly, we concatenate these five parallel output feature maps and employ a convolution to decrease the channel of the concatenated feature map from to .
Dataset. The Cityscapes dataset 
is tasked for urban scene understanding, which containsclasses and only classes of them are used for scene parsing evaluation. The dataset contains high quality pixel-level finely annotated images and coarsely annotated images. The finely annotated images are divided into images for training, validation and testing.
Training settings. We set the initial learning rate as and weight decay as by default, the original image size is and we choose crop size as following PSPNet , all the baseline experiments only use the train-fine images as the training set without specification, the batch size is 8 and we choose the InPlaceABNSync 
to synchronize the mean and standard-deviation of BN across multiple GPUs in all the experiments. We employK training iterations, which take about hours with 4P100 GPUs.
Similar to the previous works , we employ the ”poly” learning rate policy, where the learning rate is multiplied by . For the data augmentation methods, we only apply random flipping horizontally and random scaling in the range of .
Loss function. We employ class-balanced cross entropy loss on both the final output of OCNet and the intermediate feature map output from resb, where the weight over the final loss is and the auxiliary loss is following the original settings proposed in PSPNet .
|Method||Train. mIoU ()||Val. mIoU ()|
|ResNet- + GP |
|ResNet- + PPM |
|ResNet- + ASPP |
|ResNet- + Base-OC|
|ResNet- + Pyramid-OC|
|ResNet- + ASP-OC|
|OHEM||Ms + Flip||w/ Val||Fine-tuning||Test. mIoU ()|
Training with only the train-fine datasets.
Training with both the train-fine and val-fine datasets.
Object context vs. PPM and ASPP
. To evaluate the effectiveness of OCNet, we conduct a set of baseline experiments on Cityscapes. Especially, we run all of the experiments for three times and report the mean and the variance to ensure that our results are reliable. We use the ResNet-+ GP to represent employing the global average pooling based context following ParseNet , ResNet- + PPM represents the PSPNet that applies pyramid pooling module on feature maps of multiple scales and ResNet- + ASPP follows the DeepLabv3 that incorporates the image-level global context into the ASPP module except that we increase the output channel of ASPP from to in all of our experiments to ensure the fairness.
We compare these three methods with the object context module based methods such as Base-OC, Pyramid-OC and ASP-OC. The related experimental results are reported in Table 1, where all the results are based on single scale testing. The performance of both PSPNet and DeepLabv3 are comparable with the numbers in the original paper.
According to the performance on the validation set, we find that our basic method ResNet- + Base-OC can outperform the previous state-of-the-art methods such as PSPNet and DeepLabv3. We can further improve the performance with the ASP-OC module. For example, the ResNet- + ASP-OC achieves about on the validation set based on single scale testing and improves about point over DeepLabv3 and points over PSPNet.
Ablation study. Based on the ResNet- + ASP-OC method (mIoU= on the Val./Test. set), we adopt the online hard example mining (OHEM), multi-scale (Ms), left-right flipping (Flip) and training with validation set (w/ Val) to further improve the performance on the test set. All the related results are reported in Table 2.
OHEM: Following the previous works 
, the hard pixels are defined as the pixels associated with probabilities smaller thanover the correct classes. Besides, we need to keep at least pixels within each mini-batch when few pixels are hard pixels. e.g., we set and on the Cityscapes and improves mIoU on validation set and mIoU on test set.
Ms + Flip: We further apply the left-right flipping and multiple scales including to improve the performance from to on the test set.
Training w/ validation set: We can further improve the performance on test set by employing the validation set for training. We train the OCNet for 80K iterations on the mixture of training set and validation set and improve the performance from to on the test set.
|Method||mIoU ()||Pixel Acc ()|
|ResNet- + GP |
|ResNet- + PPM |
|ResNet- + ASPP |
|ResNet- + Base-OC|
|ResNet- + Pyramid-OC|
|ResNet- + ASP-OC|
Results. We compare the OCNet with the current state-of-the-art methods on the Cityscapes. The results are illustrated in Table 3 and we can see that our method achieves better performance over all the previous methods based on ResNet-. OCNet without using the validation set achieves even better performance than most methods that employ the validation set. Through employing the validation set and fine-tuning strategies, OCNet achieves new state-of-the-art performance of on the test set and outperforms the DenseASPP based on DenseNet- by over point.
Dataset. The ADE20K dataset  is used in ImageNet scene parsing challenge 2016, which contains classes and diverse scenes with image-level labels. The dataset is divided into K/K/K images for training, validation and testing.
Training setting. We set the initial learning rate as and weight decay as by default, the input image is resized to the length randomly chosen from the set due to that the images are of various sizes on ADE20K. The batch size is 8 and we also synchronize the mean and standard-deviation of BN cross multiple GPUs. We employ 100K training iterations, which take about hours with ResNet- and hours with ResNet- based on 4P100 GPUs.
The experiments on ADE20K are based on the open-source implementation . By following the previous works [34, 3], we employ the same ”poly” learning rate policy and data augmentation methods and employ the deep supervision in the intermediate feature map output from resb.
Object context vs. PPM and ASPP. We follow the same settings as the previous comparison experiments on Cityscapes. We also re-run all of the experiments for three times and report the mean and the variance. We compare the ResNet- + GP, ResNet- + PPM and ResNet- + ASPP with ResNet- + Base-OC, ResNet- + Pyramid-OC and ResNet- + ASP-OC. The related experimental results on ADE20K are reported in Table 4, where all the results are based on single scale testing.
The performance of both PSPNet and DeepLabv3 is comparable with the numbers reported in the original paper. We can see that both ResNet- + Pyramid-OC and ResNet- + ASP-OC achieve better performance compared with the ResNet- + Base-OC, which verifies the effectiveness of considering the multi-scale context information. Especially, ResNet- + Pyramid-OC improves the ResNet- + PPM by about point while ResNet- + ASP-OC improves the ResNet- + ASPP by about points.
Results. To compare with the state-of-the-art, we replace the ResNet- with ResNet- and further employ the multi-scale, left-right flipping strategies to improve the performance. According to the reported results in Table 5, OCNet improves the previous ResNet- based state-of-the-art method EncNet by about points, and OCNet also improves the PSPNet based on ResNet- by about points.
Dataset. The LIP (Look into Person) dataset  is employed in the LIP challenge 2016 for single human parsing task, which contains images with classes ( semantic human part classes and background class).
Training setting. We set the initial learning rate as and weight decay as following the CE2P . The original images are of various sizes and we resize all the images to . The batch size is and we also employ the InPlaceABNSync. We employ 110K training iterations, which take about hours with 4P100 GPUs. We also employ the same (i) ”poly” learning rate policy, (ii) data augmentation methods and (iii) deep supervision in the intermediate feature map output from resb following the experiments on Cityscapes and ADE20K.
Results. We evaluate the OCNet (ResNet- + ASP-OC) on the LIP benchmark and report the related results in Table 6. We can observe that the OCNet improves points over the previous state-of-the-art methods on the validation set of LIP. Especially, the human parsing task is different from the previous two scene parsing task as it is about labeling each pixel with the part category that it belongs to. The state-of-the-art results verify that OCNet generalizes well to the part-level semantic segmentation tasks.
We randomly choose some examples from the validation set of Cityscapes and visualize the object context map learned within OCNet in the first five rows of Figure 3, where each object context map corresponds to the pixel marked with red ✙ in both the original images and ground-truth segmentation maps.
As illustrated in Figure 3, we can find that the estimated object context maps for most classes capture the object context that mainly consists of pixels of the same categories. Take the image on the row as an example, it can be seen that the object context map corresponding to the pixel on the object bus distributes most of the weights on the pixels lying on the object bus and thus the object bus’s context information can help the pixel-wise classification.
Besides, we also illustrate some examples from the ADE20K and LIP in the middle three rows and the last three rows of Figure 3. It can be seen that most of the weights within each object context map are focused on the objects or parts that the selected pixel belongs to.
In this work, we present the concept of object context and propose the object context pooling (OCP) scheme to construct more robust context information for semantic segmentation tasks. We verify that the predicted object context maps within OCP distribute most of the weights on the true object context by visualizing multiple examples. We further demonstrate the advantages of OCNet with state-of-the-art performance on three challenging benchmarks including Cityscapes, ADE20K and LIP.