Consensus Feature Network for Scene Parsing

07/29/2019 ∙ by Tianyi Wu, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences Baidu, Inc. 4

Scene parsing is challenging as it aims to assign one of the semantic categories to each pixel in scene images. Thus, pixel-level features are desired for scene parsing. However, classification networks are dominated by the discriminative portion, so directly applying classification networks to scene parsing will result in inconsistent parsing predictions within one instance and among instances of the same category. To address this problem, we propose two transform units to learn pixel-level consensus features. One is an Instance Consensus Transform (ICT) unit to learn the instance-level consensus features by aggregating features within the same instance. The other is a Category Consensus Transform (CCT) unit to pursue category-level consensus features through keeping the consensus of features among instances of the same category in scene images. The proposed ICT and CCT units are lightweight, data-driven and end-to-end trainable. The features learned by the two units are more coherent in both instance-level and category-level. Furthermore, we present the Consensus Feature Network (CFNet) based on the proposed ICT and CCT units. Experiments on four scene parsing benchmarks, including Cityscapes, Pascal Context, CamVid, and COCO Stuff, show that the proposed CFNet learns pixel-level consensus feature and obtain consistent parsing results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Visualization using t-SNE [27] to illustrate features learned from FCN (ResNet-101) and the proposed CFNet. (a) Input image, in which region B and C are within the same instance, and region A and C belong to the same category. (b) Ground truth. (c) Features of A, B and C learned by FCN are far apart. (d) FCN has inconsistency parsing prediction within one instance and those of the same category. (e) Features of A, B and C learned by the proposed CFNet are coherent and indistinguishable. (f) The proposed CFNet has consistent parsing prediction within instances of the same category. (Best viewed in color)

Scene parsing has been an essential component for scene understanding and can play a crucial role in applications such as auto-driving, auto-navigation, and virtual reality. The goal of scene parsing is to label each pixel to one of the semantic categories including not only discrete objects (e.g., car, bicycle, people) but also stuff (e.g., road, sky, bench).

Deep Convolutional Neural Networks (DCNNs) have achieved remarkable progress in semantic segmentation or scene parsing. Currently, most of the successful methods for scene parsing are based on classification networks

[36, 17, 19]. However, there are some limitations for taking classification networks as the feature extractor for scene parsing. Classification networks tend to learn the image-level representation of the whole input examples. Moreover, previous works [49, 53, 18] show that the image-level representation is often dominated by the discriminative portion of the foreground or predominant objects, e.g., the horse’s head and dog’s face. However, scene parsing aims to parse both discrete objects and stuff things, so the pixel-level features are desired. Therefore, directly applying classification networks to scene parsing will result in two drawbacks, as shown in Fig.  1(d): (1) The intra-class features of all spatial positions of dominated objects are not consistent, leading to inconsistent parsing predictions within one instance. (2) The inter-class features of non-discriminative regions (e.g., subordinate objects and stuff) are easily confused, resulting in the inconsistent prediction of instances of the same category.

To address the above problem, we expect to learn the pixel-level consensus features for scene parsing. The consensus features are inspired by neighborhood consensus [37, 54, 34, 8, 30, 31] which finds reliable dense correspondences between a pair of images in object matching. In this work, we aim to learn the consensus features which are indistinguishable for pixels within an instance or a category. The consensus features contain two aspects: instance-level and category-level. As shown in Fig. 1 (a), (1) features of regions in the same instance (e.g., B and C) should keep the instance-level consensus, and (2) features of regions in different instances with the same category (e.g., A and C) should maintain the category-level consensus.

To learn the consensus features, we propose two consensus transform units, including Instance Consensus Transform (ICT) unit and Category Consensus Transform (CCT) unit. The ICT unit is expected to learn the instance-level consensus features. Specifically, we introduce a lightweight local network (abbreviated as LN) to generate the instance-level transform parameters for each pixel by using surrounding contextual information. Then we apply the instance-level transform parameters to aggregate features within the same instance. On the other hand, due to multiple instances of the same category in the scene images, we employ the CCT unit to pursue the category-level consensus features. Specifically, we introduce a lightweight global network (abbreviated as GN) to generate the category-level transform parameters. Different from LN, GN aims to model the interaction at specific locations with respect to all other locations. The proposed two units are learned in a data-driven manner without any extra supervision. We update features at all position with these two units. For each position, the two units can adaptively strengthen the information of relative locations (regarded as foreground) and suppress the irrelative locations (regarded as background). Thus, the consensus features are indistinguishable within the foreground and invariant to the background variations. Compared with FCN in Fig.1(c), The features learned by the two units are more coherent in instance-level and category-level, as shown in Fig.1(e). Meanwhile, the inconsistent parsing prediction in Fig.1(d) is corrected by the proposed methods, as shown in Fig.1(f).

Based on the proposed ICT and CCT units, we present a new scene parsing framework, called Consensus Feature Network (CFNet), to learn pixel-level consensus feature and obtain consistent parsing results. We demonstrate the effectiveness of the proposed approach, achieving the state-of-the-art performance on four challenging scene parsing datasets, Cityscapes [12], PASCAL Context [28], CamVid [6] and COCO Stuff [7].

2 Related Work

In 2015, Long proposed FCN [26], which is the first approach to adopt classification networks to get dense output and end-to-end training. Later, how to better adjust the classification network for scene parsing has attracted more and more attention. Hence, we review several aspects of research related to this work.

Contextual information plays a vital role in scene understanding [3, 40]. Recent works [10, 13, 47, 50] have shown that contextual information is helpful for models to make a better local decision. One direction is to append context aggregation modules to learn contextual information. Liu proposed ParseNet [25], which uses global context to augment the feature at each location. Chen [9] introduced atrous spatial pyramid pooling to learn contextual information. However, Zhao [55] introduced a pyramid pooling module to exploit global information from different subregions. Zhang [51] proposed to use context to refine the inconsistent parsing results iteratively. More recently, Ding [13] proposed a novel context contrasted local feature that not only leverages the informational context but also spotlights the local information in contrast to the context. Zhang [50] introduced a Context Encoding Module which can capture global context and selectively highlight the class-dependent feature maps. Zhao [56] proposed to relax the local neighborhood constraint for enhancing information flow. However, Fu [14] and Yuan [48] proposed self-attention-based position module to learn the global interdependences of features. In contrast to them, we propose to exploit surrounding contextual information and long-range dependencies to generate the parameters of consensus transforms.

Figure 2:

An overview of the Consensus Feature Network (CFNet). (a) Network Architecture. We take ResNet-101 as the backbone, the ICT and CCT units are inserted to ResNet-101 on Res3 and Res4 for learning the consensus feature. (b) Components of the Instance Consensus Transform (ICT) unit. (c) Components of the Category Consensus Transform (CCT) unit. The ICT and CCT units are applied to pursuit the instance-level and category-level consensus respectively. Residual connection is employed in the ICT and CCT units, which improves gradient propagation.

Another relevant aspect of related works is how to suppress responses from the background. The existing works have been concerned with handcrafted features for achieving a similar property. For example, Trulls [39] proposed an embedding method where the Euclidean distance measures how likely it is that two pixels will belong to the same region. Harley [15]

designed an embedding space for estimating the pair-wise semantic similarity and used a contrastive side loss to train the “embedding” branch. Following this, the segmentation-aware convolution

[16]

is proposed to attend to inputs according to local masks. In these works, the embeddings are defined in a handcrafted manner, or a specific loss function is required to guide the process of training. Instead, we use one neural network to learn desired transforms automatically without adding any other supervision, and the learned transforms are adaptive to test examples.

The neighborhood consensus is a strategy for match filtering, which was introduced to decide whether a match is correct or not. Zhang [54] proposed to analyze the patterns of distances between neighboring matches. Similar work, Schmid [34] analyzed the patterns of angles between neighboring matches. Later the number of locally consistent matches [4] was proposed for measuring neighborhood consensus. More recently, Rocco [31] developed a neighborhood consensus network to learn neighborhood consensus constraints, which analyze the full set of dense matches between a pair of image and learns patterns of locally consistent correspondences. Motivated by the idea, we propose the consensus transforms, which analyzes the pixel-wise feature matches and transforms in each instance or instances of the same category.

3 Approach

In this section, we present the details of the proposed Consensus Feature Network (CFNet) for scene parsing. First, we will introduce the general framework of the proposed method. Then, we will present the ICT and CCT units which are employed to achieve instance-level and category-level consensus, respectively.

3.1 Overview

The network architecture is illustrated in Fig. 2

(a). An input image is fed into a classification network (ResNet-101) pre-trained on ImageNet, which is adapted to a fully convolutional fashion

[26]. Similar to previous works [55, 50], dilated convolutions are employed in Res3 and Res4 with the output size of . Previous works [49, 47] have shown that the network encodes finer spatial information in the lower stages, and learn richer semantic feature in the higher stages. Therefore, we choose semantic-level features to conduct the consensus transforms. To perform the consensus transforms friendly, we need to get the instance-level consensus before reaching category-level consensus. So the ICT and CCT units are added after Res3 and Res4, respectively. Then we predict the label for each pixel according to the transformed feature maps, and up-samples the label map for times at last. The proposed ICT and CCT units will be described and formulated in detail as follows.

Figure 3: Illustration of the Instance Consensus Transform. A response can be reconstructed by the features at the surrounding window. The length of the arrow in the left subfigure indicates the interaction intensity.

3.2 Instance Consensus Transform Unit

To achieve the instance-level consensus, we propose the ICT unit to learn the instance-level consensus features. We do not employ an object detector pre-trained on an additional dataset to find each object, but approximate this process by conducting the transform in a surrounding window (for a given position). Fig. 3 illustrates this transform, given a target point, its responses can be reconstructed by features at the surrounding window. Specifically, features belonging to the same instance are enhanced, features of other instance are weakened. From the right subfigure of Fig. 3, we can observe that the transformed feature is more coherent.

The ICT unit uses surrounding contextual information to generate the transform parameters for each spatial location. As shown in Fig. 2(b), for a local feature map , with being the feature dimension and the spatial size, the ICT unit firstly applies one convolution layers with filters on to reduce dimension for saving computation, while obtaining the feature map , where . is the channel number of feature maps, which is less than (typically ).

After obtaining the feature map , the ICT unit further employs a lightweight local network (LN) to generate the parameters of the ICT , where represents the size of the local region centered on the current spatial position (typically ). The size of can vary depending on the local region size . We expect LN to generate transform parameters for each pixel by using the corresponding surrounding context information. We instantiate LN with two convolutional layers, whose filter size is and , respectively. The first convolutional layer () is employed to capture surrounding contextual information, and then is fed into the second convolutional layer ( for generating the parameters of the ICT. Then the transform parameters are reshaped into , where . Meanwhile, the feature map is conducted with an unfold operation for extracting sliding local feature blocks and is reshaped for obtaining the feature map . We define function

as the elemental multiplication of the tensor

and , and then sum it according to the last dimension. The new feature map are generated by

(1)

Next, we reshape to

. In particular, any feature vector

in at position is multiplication of the associated neighbours in and the corresponding instance-level consensus transform parameters , where , is a square with the center of . So the transform operator at each location can be formulated as:

(2)

where . Eqn. (2) encapsulates the transform of various handcrafted filters in a generalized way. For the bilateral filter [38], is a Gaussian that jointly captures RGB and geometric distance between pixels and . For the mean filter [33], .

After obtaining the feature map , we apply one convolution layer with filters for dimension expansion, so that the output dimension can match the dimension of input , forming a residual connection.

3.3 Category Consensus Transform Unit

It is very useful for high-quality scene segmentation to achieve category-level consensus, since there are usually multiple objects of the same class in the scene images. For example, for Cityscapes [12] dataset, there are 7 humans and 14 vehicles per image on averages. Therefore, we propose the Category Consensus Transform (CCT) unit to pursue category-level consensus features.

The structure of CCT unit is illustrated in Fig. 2(c), we deploy a global network (GN) to generate the category-level consensus transform parameters , where . We expect GN to have the ability of “seeing” the whole feature map , and has the ability to model the interaction between each location and other locations across the whole input feature maps. A natural solution is to employ a fully connected layer, global convolution, stacked multiple large kernel convolutions. These solutions are not very effective, since they introduce a huge number of parameters or memory usage.

Inspired by [41]

which introduce recurrent neural networks to model region-wise dependency, we instantiate GN with two bidirectional LSTMs (BiLSTMs) and one convolutional layer with

filters. We use the first BiLSTM to scan the feature maps in bottom-up and top-down directions, as shown in Fig. 2(c). It takes a row-wise feature as input for one time and updates its hidden state. A typical LSTM unit contains an input gate , a forget gate , an output gate , an output state , and an internal memory cell state . The rule of scanning can be formulated as follows:

(3)

The detailed computation is described as follows:

(4)
(5)
(6)

where means element-wise product, , indicates the sliced feature map and is the input to the LSTM at time step t, and denotes the modulated input.

is the sigmoid activation function.

is an affine transform consisting of the parameters of LSTM, where is the number of LSTM cell state units. However, the BiLSTM computes the forward hidden sequence and the backward hidden sequence by iterating the forward layer form to , the backward layer from to simultaneously. The calculations of BiLSTM can be formulated as follows:

(7)
(8)

After bidirection sweeping, we concateenate the hidden states and to get a composite feature map . In a similar manner, we employ the second BiLSTM to sweep over feature maps horizontally, which takes a column-wise feature slice as input for one time and updates its hidden state. Then we concatenate the forward and backward hidden states for the second BiLSTM to get feature maps , which is taken as the representation of global interaction between each spatial position and all other locations. Then each response in feature maps is an activation at the specific location with respect to the whole image. Afterward, the global interaction information is fed into the Conv layer to generate the transform parameters . We define function as the matrix product of tensor and tensor . The new feature maps is generated by

(9)

where . Next, we reshape to feature maps . In pariticular, any feature vector in at position is generated by

(10)

where , , , is the feature at the location on feature maps .

Note that if learns the responses based on the relationship between and , Eqn. (10) is equivalent to non-local operation [42]. The non-local unit generates an attention map for the feature which has a limited receptive field. For the CCT unit, the response between any two points is not simply a matter of modeling the relationship between two features, but the interaction of other features with them. Furthermore, the response in non-local unit is computed by handcrafted pairwise function (e.g., gaussian, embedded gaussian, dot product), but our parameters are dynamically generated by the GN, which is adaptive to each test example.

After obtaining global consensus feature maps , we apply one convolution layer with filters for dimension expansion. Finally, residual learning is employed to improve the gradient back-propagation during training.

4 Experiments

To validate the proposed approach, we conduct comprehensive experiments on multiple datasets, including Cityscapes dataset [12], PASCAL Context dataset [28], CamVid dataset [6], and COCO Stuff dataset [7]. In the following subsections, we first describe the datasets and the experimental settings. Then the contributions of each component are investigated in ablation experiments on Cityscapes dataset. Finally, we report our results on four scene parsing benchmarks, i.e., Cityscapes dataset, PASCAL Context, CamVid, COCO Stuff, and compare our proposed approach with the state-of-the-art approaches.

4.1 Datasets

Cityscapes Dataset The dataset contains 5, 000 finely annotated images and 20, 000 coarsely annotated images collected in street scenes from 50 different cities, which is targeted for urban scene segmentation. Only the 5, 000 finely annotated images are used in our experiments, divided into three subsets, including 2, 975 images in training set, 500 images in validation set and 1, 525 images in test set. High-quality pixel-level annotations of 19 semantic classes are provided in this dataset.

PASCAL Context Dataset The dataset involves 4, 998 images in training set and 5, 105 images in the test set. It provides detailed semantic labels for the whole scene. Similar to [43, 50], the proposed approach is evaluated on the most frequent 59 categories and 1 background class.

Method mIoU (%)
Res101 (baseline) 74.9
Res101 + ICT (0,5) 74.3
Res101 + ICT (1,5) 75.3
Res101 + ICT (2,5) 75.7
Res101 + ICT (3,5) 78.8
Res101 + ICT (4,5) 78.6
Table 1: Ablation experiments of ICT on the validation set of Cityscapes. ICT represents the Instance Consensus Transform unit. Without loss of generality, “ICT(3,5)” means the ICT unit with is inserted to ResNet-101 on Res3.
Method mIoU (%)
Res101 (baseline) 74.9
Res101 + ICT () 77.4
Res101 + ICT () 78.8
Res101 + ICT () 76.8
Table 2: Ablation experiments of “r” on the validation set of Cityscapes. “” indicates the size of the local window in the ICT unit.

Method
mIoU (%)
Res101 (baseline) 74.9
Res101 + CCT (1x1 Conv) 75.7
Res101 + CCT (Global Conv) 76.9
Res101 + CCT (Large Kernel) 75.3
Res101 + CCT (Stacked Conv) 76.7
Res101 + CCT (BiLSTM) 77.5
Table 3: Comparison of different instantiations of CCT unit on the validation set of Cityscapes. CCT represents the Category Consensus Transform unit.

CamVid Dataset The CamVid is a road scene dataset from the perspective of a driving automobile. The dataset involves 367 training images, 101 validation images, and 233 test images. The images have a resolution of 960 720. Following [21, 2, 1, 5], we consider 11 larger semantic classes (road, building, sky, tree, sidewalk, car, column-pole, fence, pedestrian, bicyclist, and sign-symbol) for evaluation.

COCO Stuff Dataset The dataset contains 10, 000 images from Microsoft COCO dataset [24], out of which 9, 000 images are for training and 1, 000 images for testing. The unlabeled stuff pixels in original images of Microsoft COCO are further densely annotated with extra 91 classes. Following [13], we evaluate the proposed method on 171 semantic classes including 80 objects and 91 stuff annotated to each pixel.

4.2 Experimentation Details

We take ResNet-101 [17] pre-trained on ImageNet as the backbone. Similar previous works [55, 50], dilated convolutions are employed in Res3 and Res4 with the output size of

. The output predictions are upsampled 8 times using bilinear interpolation. Meanwhile, we replace the standard Batchnorm with InPlace-ABN

[32] to the mean and

Method mIoU (%)
Res101 (baseline) 74.9
Res101 + NL 76.8
Res101 + CCT (BiLSTM) 77.5
Table 4: Comparison of Non-local and CCT unit on the validation set of Cityscapes. “NL” indicates non-local unit [42].

Method
mIoU (%)
Res101 (baseline) 74.9
Res101 + ICT 78.8
Res101 + CCT 77.5
Res101 + ICT + CCT 79.9
Table 5: Ablation experiments of ICT and CCT on Cityscapes validation set.

standard-deviation of BatchNorm across multiple GPUs. The SGD with mini-batch is used for training. Following prior work [55, 50], we use the “poly” learning rate policy, where the learning rate is multiplied by with

. The base learning rate is set to 0.01 for Cityscapes. The momentum is set to 0.9 and weight decay is set to 0.0001. For data augmentation, we adopt randomly scaling in the range of [0.5,2] and then randomly cropping the image into a fixed size using zero padding if necessary. For loss function, we employ cross entropy loss on both the final output of CFNet and intermediate output from ‘Res3’. Similar to the original setting introduced by Zhao

[55], the weight over the main loss and auxiliary loss is set to 1 and 0.4 respectively. The performance is reported using the commonly mean Intersection-over-Union (IoU). We use a single-scale evaluation to compute mean IoU in all ablation experiments. For evaluation, we average the network prediction in multiple scales following[9, 55, 50].

4.3 Experiments on Cityscapes

4.3.1 Which stage to add ICT unit?

Tab. 1 compares the proposed ICT unit added to different stages of ResNet. The unit is added after the last residual block of a stage. We use “(n,r)” to indicate insert location and the region size of the Instance Consensus Transform. For example, “Res101 + ICT (1,5)” means the ICT unit with is inserted to ResNet-101 on Res1. As shown in Tab. 1, the improvement over the baseline of the ICT unit on Res3 and Res4 is similarly significant, while on the Res1 and Res2 are slightly small. It is also interesting to see that the “Res101 + ICT (0,5)” method achieve slight lower mIoU than the baseline (74.3 vs. 74.9). Our conjecture is that the consensus transforms require features with semantic-level information, yet the lower stage of the network tends to learn spatial-level information. For subsequent experiments, we fix the ICT unit behind Res3, which


Method
Backbone mIoU (%)
DeepLab-v2 [9] ResNet-101 70.4
RefineNet [23] ResNet-101 73.6
FoveaNet [22] ResNet-101 74.1
SAC [52] ResNet-101 78.1
PSPNet [55] ResNet-101 78.4
BiSENet [46] ResNet-101 78.9
AAF [20] ResNet-101 79.1
DFN [47] ResNet-101 79.3
TKCN [43] ResNet-101 79.5
PSANet [56] ResNet-101 80.1
DenseASPP [45] DensetNet-161 80.6
GloRe [11] ResNet-101 80.9
CFNet (ours) ResNet-101 81.3
Table 6: Scene parsing results on Cityscapes test set. All results are evaluated by the official evaluation server. Our method only train on both train-fine and val-fine set, without using extra “coarse” training set.

has 3.9% improvement over the baseline.

4.3.2 Different sizes of in ICT unit

Tab. 2 compares different sizes of when the ICT unit is added after Res3. From the experimental results in Tab. 2, we can see that increasing (from 3 to 5) can improve performance, however, performance will drop obviously when increasing , which shows that choosing the right is important for the instance-level consensus transform. For subsequent experiments, we configure ICT with , which has 3.9% improvement over the baseline ( 78.8% vs. 74.9%).

4.3.3 Different instantiations of CCT unit

Tab. 3 compares different types of GN in the CCT unit. (1) Conv: we simply instantiate GN with convolution layer, which has a limited receptive field (relative to input feature maps), and is employed for generating the parameters of CCT. (2) Global Conv: we use one convolution with dilation = 3, which take whole input feature maps as the receptive field. (3) Large Kernel [29], which combine and convolution. (4) Stacked Conv: It means that two convolution with dilation = 2 is used. (5) Bidirectional LSTM: GN is instantiated with two BiLSTMs and a convolution layer. Interestingly, the CCT (BiLSTM) version can lead to 2.6% improvement. However, the CCT( Conv), CCT (Global Conv), CCT (Large Kernel ) and CCT (Stacked Conv) version is slightly smaller, which verifies that modeling global interaction to generate the parameters of CCT is reasonable and very essential.

Figure 4: Visualization result of the category consensus transform on Cityscapes validation set. From left to right are: input image, the parameter maps of CCT, prediction and ground truth.

4.3.4 CCT unit vs. Non-local unit

Tab. 4 compares our CCT unit with non-local unit [42] (denoted as “+NL”). The non-local unit can generate attention masks for each position by considering the pair-wise feature correlation, and compute the response at a position as a weighted sum of the features at all positions. The proposed CCT unit can achieve better performance than “Res101 + NL ”. Here we give some possible explanations: (1) For the non-local unit, although the response at any position is a weighted sum of the features at all positions, the weight parameters are computed by the feature which has the limited receptive field, and does not model complex global interaction. (2) In contrast to them, the proposed CCT unit employs row-wise and column-wise BiLSTM to scan the whole feature map, which has a global receptive field. Therefore, the process of generating the parameters of CCT for each position model the interaction at the specific location with all other positions.

4.3.5 Intergrating ICT and CCT unit

Now, we conduct experiments with different settings in Tab. 5 to verify the effectiveness of consensus transforms. As shown in Tab. 5, the consensus transform units can improve the performance remarkably. Compared with the baseline method, employing the ICT unit can yield a result of 78.8% in mean IoU, which has 3.8% improvement. Meanwhile, employing CCT unit can individually bring 2.6% improvement over the baseline. When we integrate the ICT and CCT units, the performance is further improved to 79.9%, which has 5.0% improvement over the baseline (79.9 vs. 74.9). These experiments show that the integration of both units can bring great benefit to scene parsing.

Method Backbone mIoU (%)
DeepLab-v2 [9] ResNet-101 45.7
RefineNet [23] ResNet-152 47.3
CCL [13] ResNet-101 51.6
EncNet [50] ResNet-101 51.7
TKCN [43] ResNet-101 51.8
FCN (baseline) ResNet-101 44.1
CFNet (ours) ResNet-101 52.4
Table 7: Accuracy comparison of our method against other methods on PASCAL Context test set. indicates using extra training data form COCO ( 100K images), while we only use the PASCAL Context training data.

Method
Backbone mIoU (%)
SegNet [2] VGG-16 55.6
CGNet [44] - 65.6
G-FRNet [1] VGG-16 68.0
BiSeNet [46] ResNet-18 68.7
DenseDecoder [5] ResNeXt-101 70.9
FCN (baseline) ResNet-101 67.5
CFNet (ours) ResNet-101 71.6
Table 8: Accuracy comparison of our method against other methods on CamVid test set.

4.3.6 Visualization an Analysis

In this subsection, we give some qualitative proof of the proposed consensus transforms. We visualize the parameters of the CCT unit. According to Eqn. (10), each position has parameters, and each transformed feature is a linear combination of all input features. As shown in Fig. 4, we choose one point (marked as +) in each image and visualize their parameters in the second column. We can observe that the category-level consensus transform focuses on aggregating features of the same semantic category. For example, for the point in the first row, its parameter map focus on the position which belongs to the “car” category, which demonstrates that our proposed CCT unit is very helpful for learning category-level consensus features.

4.3.7 Comparison with state-of-the-arts

We further compare the proposed methods with existing methods on the Cityscapes test set. Specifically, we just only use fine annotated data to train our CFNet and submit our test results to the official evaluation server. Performance is shown the Tab. 6, the proposed CFNet can achieve 81.3% in mean IoU, which outperforms the PSANet and DenseASPP. It is worth noting that DenseASPP employs a more powerful network (DenseNet [19]) as the backbone than ours.

4.4 Experiments on PASCAL Context

We conduct experiments on the PASCAL Context dataset to further verify the effectiveness of our approach. The crop-size is set to

, and training times is set to 30 epochs. Other training and testing setting are the same as that on the Cityscapes dataset. Quantitative results of this dataset are shown in Tab. 

7. The proposed CFNet achieve 52.4% in mean IoU, substantially brings an 8.3% improvement over the baseline (52.4% vs. 44.1%). Among existing works, most of them use multiple scale feature learning or employ

Method Backbone mIoU (%)
DeepLab-v2 [9] ResNet-101 26.9
DAG-RNN [35] VGG-16 30.4
RefineNet [23] ResNet-101 33.6
CCL [13] ResNet-101 35.7
FCN (baseline) ResNet-101 30.2
CFNet (ours) ResNet-101 36.6
Table 9: Accuracy comparison of our method against other methods on COCO Stuff test set.

context modules to improve performance. In contrast to them, we introduce the two consensus transform units to learn the instance-level consensus and category-level consensus features. and the proposed approach can achieve better parsing results.

4.5 Experiments on CamVid

We conduct experiments on the CamVid dataset to further verify the effectiveness of our approach. To make our experimental setting comparable to previous works [21, 2, 1, 5], we downsample the images in the dataset by a factor of 2. The base learning rate is set to 0.025, crop-size is set to , and training times are set to 100 epochs. Quantitative results of this dataset are shown in Tab. 8. The baseline method achieves 67.5%. The proposed CFNet achieve 71.6%, which outperforms previous state-of-the-art method DenseDecoder [5], which introduces dense decoder shortcut connections for fuse semantic feature maps form all previous decoder levels.

4.6 Experiments on COCO Stuff

Finally, we further run our method on the COCO Stuff dataset for demonstrating the generality of the proposed CFNet. The crop-size is set to , and training times are set to 25 epochs. The experiment results as shown in Tab. 9. The baseline achieves 30.2% in mean IoU. Our method achieves 36.6% mean IoU, which outperforms the previous state-of-the-art method CCL [13].

5 Conclusion

In this work, we propose the Instance Consensus Transforms and Category Consensus Transform units to learn the instance-level and category-level consensus features, which is desired for scene parsing. Based on the proposed two units, we develop a novel framework called Consensus Feature Network (CFNet). The ablation experiments demonstrate that the proposed approach can effectively learn the pixel-wise consensus features, and obtain consistent parsing results. Furthermore, we show the advantages of CFNet with state-of-the-art performance on four benchmarks including Cityscapes, PASCAL Context, CamVid, and COCO Stuff.

References