Boundary-Aware Feature Propagation for Scene Segmentation

08/31/2019 ∙ by Henghui Ding, et al. ∙ Nanyang Technological University 0

In this work, we address the challenging issue of scene segmentation. To increase the feature similarity of the same object while keeping the feature discrimination of different objects, we explore to propagate information throughout the image under the control of objects' boundaries. To this end, we first propose to learn the boundary as an additional semantic class to enable the network to be aware of the boundary layout. Then, we propose unidirectional acyclic graphs (UAGs) to model the function of undirected cyclic graphs (UCGs), which structurize the image via building graphic pixel-by-pixel connections, in an efficient and effective way. Furthermore, we propose a boundary-aware feature propagation (BFP) module to harvest and propagate the local features within their regions isolated by the learned boundaries in the UAG-structured image. The proposed BFP is capable of splitting the feature propagation into a set of semantic groups via building strong connections among the same segment region but weak connections between different segment regions. Without bells and whistles, our approach achieves new state-of-the-art segmentation performance on three challenging semantic segmentation datasets, i.e., PASCAL-Context, CamVid, and Cityscapes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 6

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene segmentation is a challenging and fundamental task that aims to assign semantic categories to every pixels of scene images. The key of scene segmentation refers to parsing and segmenting a scene image into a range of semantic coherent regions. Therefore, it is critical to improve the feature similarity of the same object while keeping the feature discrimination of different objects. To this end, on the one hand, we explore to propagate features throughout the images to share features and harvest context information, which is beneficial for improving the feature similarity. One the other hand, in order to keep the discriminative power of features belonging to different objects, we propose to make use of boundary information to control the information flow during propagation progress. In a word, we propose a boundary-aware feature propagation module to build strong connections within the same segment and weak connections between different segments, as shown in Figure 1. This module requires two components: boundary detection and graph construction.

Figure 1: (Best viewed in color) The boundary-aware feature propagation module builds stronger connections within the same segment and weaker connections between different segments, which helps to enhance the similarity of features belonging to the same segment while keeping discrimination of features belonging to different segments.

First, boundary detection, which is an implicit task in scene segmentation, is important for meticulous dense prediction. However, in existing segmentation methods, boundary detection did not attract due attention since boundary pixels only account for a small portion of the entire image and it has little contribution to the performance improvement. In this work, we try to find a way to simultaneously achieve segmentation and boundary detection, and further make use of the learned boundary information to enhance the segmentation performance. With regards to this, we propose to generate the boundary label of semantic objects from the existing object class labels given in the segmentation datasets and define it as an additional class for learning and classification. By doing so, concise boundaries are well learnt and inferred as one additional class because the characteristics of pixels on boundary are different from those of most pixels off boundary. And the parsing of pixels among disputed area (i.e., near the boundary) is enhanced. Moreover, taking boundary as an additional class requires little change on network but makes the network be aware of the boundary layout that could be further used for segmentation improvement.

Second, graphic model is needed to create the order rules for feature propagation. Convolutional methods [13, 79] are popular in scene segmentation, but they usually consume large computation resources when aggregating features from grand range of receptive fields. Moreover, the convolution kernels could not vary with input resolutions, thus cannot ensure a holistic view of the overall image. Recently, DAG-RNN [66] proposes to use four directed acyclic graphs (DAGs) with different directions to model the function of undirected cyclic graphs (UCGs), which structurize the image by building pixel-by-pixel connections throughout whole image. However, DAGs require lots of loops to scan the images pixel by pixel. Thus it is very slowly even on low-resolution feature maps, which limits its application on “dilated FCN” [13, 86, 82] and on high resolution datasets like Cityscapes [15]. To address this issue, we propose a more efficient graphic model to achieve faster feature propagation. We find that each DAGs adopted by [66] could be alternatively replaced by two Unidirectional Acyclic Graphs (UAGs), in which the pixels of the same row or column are dealt in parallel with 1D convolution. The proposed UAGs greatly speed up the feature propagation process. Moreover, different from the DAGs that are extremely deep, the proposed UAGs are much shallower and thus alleviate the problem of propagation vanish [57].

Finally, based on the UAG-structured image and the learned boundaries information, we build a boundary-aware feature propagation (BFP) module. In the BFP, local features of the same segment are shared via unimpeded connections to exchange information that achieves feature assimilation, while features of different segments are exchanged under controlled connections with the guidance of learned boundaries. There are several advantages of our proposed boundary-aware feature propagation (BFP) network. First, since the proposed UAGs deal with pixels of the same row or column in parallel, we achieve the propagation process in a high speed. And the UAGs contain much fewer parameters than convolutional methods. Second, as we express boundary detection as classification of a semantic class, lots of parameters and complex module for boundary detection are saved. Third, with the advice of boundary confidence, the local features are propagated in a more motivated way, enhancing the similarity of features belonging to the same segment while keeping the discrimination of features belonging to different segments.

The main contributions of this paper can be summarized as follows:

  • We show that the boundary can be learned as one of semantic categories, which requires little change on network but obtains essential boundary information.

  • We propose some unidirectional acyclic graphs (UAGs) to propagate information among high-resolution images with a high speed.

  • We propose a boundary-aware feature propagation module to improve the similarity of local features belonging to the same segment region while keeping the discriminative power of features belonging to different segments.

  • We achieve new state-of-the-art performance consistently on PASCAL-Context, CamVid, and Cityscapes.

Figure 2: An overview of the proposed approach. We use the ResNet-101 (CNN) with the dilated network strategy [13] as backbone and the proposed boundary-aware feature propagation (BFP) module is placed on the top of CNN. The supervisor of loss 2 is the new ground truth of N+1 classes with an additional boundary class generated from the original ground truth of N classes.

2 Related work

2.1 Scene Segmentation

Scene segmentation (or scene parsing, semantic segmentation) is one of the fundamental problems in computer vision and has drawn lots of attentions. Recently, thanks to the great success of Convolutional Neural Networks (CNNs) in computer vision 

[42, 68, 71, 52, 25, 72, 27, 80, 26], lots of CNNs based segmentation works have been proposed and have achieved great progress [29, 22, 81, 83, 84, 70, 60]. For example, Long et al. [54] introduce the fully convolutional networks (FCN) in which the fully connected layers in standard CNNs are transformed to convolutional layers. Noh et al. [56] propose deconvolution networks to gradually upsample the coarse features to high resolution. Chen et al. [13]

propose to remove some pooling layers (or convolution stride) in CNNs and adopt dilated convolution to retain more spatial information. And some works focus on lightweight network architectures 

[3, 46] and real-time segmentation [85, 58, 77, 59].

Context aggregation is a hot direction in scene segmentation. For example, Chen et al. [13] propose an atrous spatial pyramid pooling (ASPP) module to aggregate multi-scale context information. Yu et al. [79] employ multiple dilated convolution layers after score maps to exercise multi-scale context aggregation. Zhao et al. [86] introduce pyramid spatial pooling (PSP) to exploit context information from different scale regions. Zhang et al. [82] encode semantic context to network and stress class-dependent feature maps. He et al. [30] propose adaptively pyramid context module to capture global-guided local affinity. Fu et al. [21] integrate local and global dependencies with both spatial and channel attention. Ding et al. [17] employ semantic correlation to infer shape-variant context.

Graphic models have a long history in scene segmentation. Early works construct the graphic model with hand-crafted features [24, 51, 75, 69]. Markov Random Fields (MRF) [24, 43, 45] and Conditional Random Fields (CRF) [41, 61, 13, 53] build the dependencies according to the similarities of neighboring pixels. Liang et al. [47] propose to construct graph topology based on superpixel nodes and incorporate long-range context with Graph LSTM. Shuai et al. [66] adopt undirected cyclic graphs (UCGs) to formulate an image and decompose the UCGs with directed acyclic graphs (DAGs). Byeon et al. [9] propose to divide an image into non-overlapping windows and employ a 2d LSTM to construct local and global dependencies. However, most of graph-based method are time-consuming and computationally expensive as they require candidate pre-segments, superpixels, or lots of loops.

In this work, we propose unidirectional acyclic graphs (UAGs), based on which the local features are quickly propagated in parallel. And to construct strong dependencies within the same segment and weak dependencies among different segments, we exploit the learned boundary information to guide the feature propagation within the UAG-structured image.

2.2 Boundary Detection

Boundary detection is a fundamental component for many computer vision tasks and has a long history [1, 19, 40, 35]. For example, Lim et al. [48] propose sketch tokens (ST) and Dollár [20]

et al. propose structured edges (SE) based on fast random forests to deal with boundary detection as local classification problem. Recently the success of CNNs have great improve the performance of boundary detection 

[5, 6, 33, 64, 74]. Xie et al. [74] employ features from the middle layers of CNNs for boundary detection. Shen et al. [63] propose multi-stage fully convolutional networks for boundary detection of electron microscopy images. These methods target at optimizing the accuracy of boundary detection instead of generating semantic boundary information for high-level tasks. Boundary information could be used for improving segmentation performance. For example, Bertasius et al. [6], Hayder et al. [28], Chen et al. [11] and Kokkinos [38] employ the binary edges to improve the segmentation performance. However, they all employ an additional branch of network for edge detection, which requires more resources and deal with segmentation and boundary detection as two detached tasks. Different from [6, 28, 11]

, our goal is not to detect the clearly binary boundaries, but to infer a boundary confidence map that represents the probability distributions of high-level boundary layout.

3 Approach

Due to the diverse style and complex layout of scene images, it is necessary to classify every pixel using holistic context information but protect its differentiation from overwhelming by global scene. In this respect, we propose a boundary-aware feature propagation module to arm the local features with a holistic view of contextual awareness but keep the discriminative power of features for different objects. The overall architecture is shown in Figure

2. We use the dilated FCN (subsampled by 8) based on ResNet-101 [31] as backbone. The supervisor of loss 2 is the boundary-aware ground truth (N+1 classes) that are generated from the original ground truth (N classes).

3.1 Semantic Boundary Detection

Boundary delineation is favourable for meticulous scene parsing. However, because of the various semantic labels and complex layout of objects in segmentation datasets, parsing pixels in the boundary area is always difficult and results in confused prediction. In this work, instead of directly assigning semantic labels to pixels in boundary area, we explore to infer the boundary layout first and improve the segmentation performance with the learned boundary information. Lots of works have contributed to boundary detection [5, 6, 33, 64, 74], but most of them focus on edges that sketch the objects. Different from them, we only focus on the boundaries of semantic objects that are predefined in segmentation datasets. We have observed that boundaries have the property of dramatic changes in RGB and feature information. And the boundary label is easy to be generated from the existing ground truth. Consequently, we assume that the boundary could be viewed as an additional semantic category and simultaneously learned with other existing object categories. As shown in Figure 2, we obtain a new ground truth (Loss 2, N+1 classes) from the original ground truth (Loss 1, N classes) and utilize the new ground truth for supervising the network to learn and infer the boundary layout.

Different from previous boundary detection works that aim to boundary delineation only or deal with segmentation and boundary detection as two detached tasks, our proposed semantic boundary detection is embed with semantic object parsing. Our boundary detection module only targets at boundaries of semantic objects predefined in the training data and generate concise boundary information under interaction with segmentation. These two tasks are combined into one, and they benefit each other. By training them together, the scene segmentation classes help suppress the edges within objects that are not semantic boundary of objects, e.g., edges of eyes. Scene segmentation helps the boundary detection to filter out noise and delineate well-directed boundary, while boundary detection makes the scene segmentation be aware of the important boundary layout information.

Figure 3: Each point of the DAGs has three different directions. Thus, the DAGs have to scan the image pixel by pixel and consumes lots of time with many loops. We decompose the four DAGs to six Unidirectional Acyclic Graphs (UAGs). Each of UAGs propagates information toward a single direction, which deals with pixels of each row in parallel and then deals with pixels of each column in parallel. For example, the UAG is in the south direction only and the UAG is in the east direction only (based on UAG).

(a) (b) (c) (d) (e)

Figure 4: (a) Image; (b) Original ground truth; (c) New generated ground truth: add a boundary class generated from the original ground truth; (d) Boundary confidence map: the probability distribution of the boundary layout; (e) Propagation confidence map: the confidence distribution of propagation.

3.2 Unidirectional Acyclic Graphs

Context is designed to aggregate wide range of surrounding information, thus it desires a holistic view of the overall image without regard to resolution. One popular way is to employ stacked convolution or dilated convolution to enlarge the receptive field [13, 79, 18], but this consumes large computation resources. The work of [66] proposes DAG-RNN to capture long-range context based on directed acyclic graphs (DAGs). As shown in Figure 3, pixels are locally connected to form a undirected cyclic graph (UCG) to build propagation channels among the whole image. To overcome the loopy property of UCG, the UCG is decomposed to four DAGs with different directions (southeast, southwest, northeast, northwest). However, since each pixel of the DAGs has three different directed connections, the feature propagation based on DAG-structured images has to scan the image pixel by pixel and requires lots of loops. Thus it is very slowly even on low-resolution feature maps, which limits its application on “dilated FCN” [13, 86, 82] and on high resolution datasets like cityscapes [15] and CamVid [8]. To address this issue, we explore to reduce the number of loops and propagate information in parallel. Herein, we propose some Unidirectional Acyclic Graphs (UAGs) as shown in Figure 3, which deal with pixels of each row in parallel and then deals with pixels of each column in parallel. Each DAGs adopted by [66] could be alternatively replaced by two UAGs. For example, the DAG is decomposed to UAG and UAG, where UAG is south directed that deals with pixels of the same row in parallel and UAG is east directed (after UAG) that deals with pixels of the same column in parallel. As a result, the number of loops for each DAGs is reduced from HW to HW, where and are the height and width of feature maps. The proposed UAGs greatly speed up the feature propagation process, which is economic and desired in practice, especially for applications that require high resolution and big eyeshot (e.g., self-driving vehicle). Moreover, due to the pixel-by-pixel operation, the recursive layers in DAGs are very deep that could be unrolled to thousand layers. This causes the problem of propagation vanish [57]. The proposed UAGs are much shallower than DAGs and thus alleviate the the problem of propagation vanish.

3.3 Boundary-Aware Feature Propagation

However, unselective propagation would make the features assimilated, which results in smooth representation and weakens the features’ discrimination. To classify features in different objects and stuff in scene segmentation, it is beneficial to improve the feature similarity of the same object while keeping the feature discrimination of different objects. Therefore, we introduce the boundary information into feature propagation to control the information flow between different segments. As shown in Figure 1, with the learned boundary information, we build strong connections for pixels belonging to the same segment but weak connections for different segments. During propagation, more information is passed via strong connections within the same segment and less information flows crossing different segments. In such a way, pixels get more information from other pixels of the same objects and less information from pixels of other objects. Consequently, features of different objects could keep their discrimination while features of the same object trend to be cognate, which is desired for segmentation. The detailed process of proposed boundary-aware feature propagation is presented below.

Figure 5: As our UAGs are unidirectional and in parallel, we show the propagation process in 1D here for clarity. represents the feature of pixel at position , is output (hidden state), is the propagation confidence.

As our UAGs are unidirectional and in parallel, we formulate the propagation process in 1D here for notation clarity. Extension to 2D/3D is straightforward. We denote the feature of pixel at position as , and the corresponding output (hidden state) is denoted as . The standard propagation process based on our UAG-structured image is formulated as following:

(1)

where is 1D convolution operation, are learnable parameters of 1D convolution and is learnable bias.

is element-wise nonlinear activation function (we use ReLU).

For the boundary-aware propagation, we first extract the boundary confidence map from the segmentation confidence maps, as shown in Figure 2. We denote the boundary confidence of pixel as , corresponding to . Based on the boundary confidence map, we generate the propagation confidence map:

(2)

where is the propagation confidence that decides how much the information of pixel to be passed to next region. = and = are constant chosen by experience,

is sigmoid function to enhance the boundary,

is a learnable parameter. With the propagation confidence, the propagation process can be reformulated as below:

(3)

as shown in Figure 5, the propagation is controlled by the boundary and thus models boundary-aware context feature for better parsing of different segments.

For the UAGs that have “two directions”, they are also unidirectional and in parallel. For example, UAG is formulated below:

(4)

There are two hidden states, and , input to the current cell, where and are the denotations of horizontal and vertical axis, respectively. Finally, the hidden states of the corresponding positions in four UAGs (i.e., UAG, UAG, UAG, UAG) are fused together to generate the final output, as shown in Figure 3.

One example of boundary confidence map and propagation confidence map is shown in Figure 4. We learn the boundary confidence map under the supervise of new generated ground truth with additional boundary class. To control the progress of feature propagation, propagation confidence map is generated from boundary confidence map. If pixel is in boundary region, then it has a higher boundary probability and hence a smaller propagation probability . Thus, the feature propagation is suppressed and weak signals are passed to next pixel. Otherwise, it has a strong propagation to spread its features to next pixel.

4 Experiments

4.1 Implementation Details

Our Network is implemented based on the public platform Pytorch. We use ResNet-101 

[31] with the dilated network strategy [13] (subsampled by 8) as our backbone. In detail, the convolution strides for downsampling in last two blocks are reset to 1 and the convolutions of last two blocks are dilated with dilation rates of 2 and 4 receptively. and layers after it are discarded. The network is trained with mini-batch, batchsize is set to 12 for PASCAL-Context and 8 for Cityscapes and CamVid. Following deeplab-v2 [13], we use the “poly” learning rate policy, where current learning rate depends on the base learning rate and iterations: . Momentum and weight decay are fixed to 0.9 and 0.0001 respectively. We adopt random horizontal flipping and random resizing between 0.5 and 2 to augment the training data.

Most of the scene segmentation datasets do not provide boundary layout, we use the provided segmentation ground truth to generate boundary-aware ground truth, as shown in Figure 4 (c). As we adopt the dilated FCN as our backbone, the spatial size is downsampled 8 times in encoding process. Thus to avoid the loss of boundary information in feature maps with smallest spatial size, pixels with distance smaller than 9 pixels (i.e., trimap of 18 pixels) to boundary are defined as boundary pixels and their ground truth labels are set to , where N is the number of classes in datasets. In our experiments, the over-wide boundary (e.g., trimap of 50 pixels) squeezes small objects and weakens the function of boundary in feature propagation. We evaluate our network with mean Intersection-over-Union (mIoU). Mathematical definition of mIoU please refer to [54].

4.2 Efficiency Analysis

To evaluate the speed of the proposed UAGs, we report in Table 1 the inference time of UAGs and compared it with DAGs [66] on different resolution of input images, based on dilated FCN (subsampled by 8). The number of loops is also recorded. Different from DAGs that have to scan the image pixel by pixel, the proposed UAGs deal with pixels of each row/column in parallel, hence they save a lot of loops. As shown in Table 1, the UAGs contain much less loops than DAGs and consequently they are much faster than DAGs. Especially with high resolution (e.g., 960720), DAGs are very slow and time-consuming. The speed of DAGs is highly related with the input resolution that determines the number of loops, thus DAGs are not suitable of high-resolution datasets (e.g., Cityscapes [15]) and FCN with dilated network strategy [13]. Besides the inference time, training of DAGs based on dilated FCN requires one hundred times more GPU hours than our proposed UAGs, which also shows the high efficiency of our approach. To quantitatively compare the segmentation performances of DAGs and proposed UAGs, we evaluate them on VGG-16 [67] with encoder-decoder strategy, exactly in the same way like that in [66]. DAGs and UAGs achieve almost the same results on PASCAL-Context (UAGs 43.0% vs. DAGs 42.6%). This shows that the proposed UAGs realize the same function as DAGs but with a much faster speed.

Methods Input Resolution # Loops Time (s)
FCN none 0.35
DAGs 10800 17.92
UAGs 300 0.47
FCN none 0.42
DAGs 43200 56.97
UAGs 600 0.76
Table 1: Inference speed comparison of FCN (baseline), DAGs, and UAGs on dilated ResNet-101 with different resolution inputs.

4.3 Ablation Studies

We show the detailed results of ablation studies of the proposed approach in Table 2. The proposed UAGs harvests local features and propagate them throughout the whole image to achieve a holistic context aggregation, which greatly improves the segmentation performance from the baseline (dilated FCN). Then, based on the UAGs, we learn the boundary information and inject it to the propagation process to control the information flow between different regions. With the boundary information, the UAGs build stronger connections of pixels within the same segment and weaker connections between different segments. Thus, features of the same segment become more similar but features of different segments remain discriminative.

We also visualize some examples of the inferred boundary confidence map in Figure 6. As shown in Figure 6, the inferred boundary map mainly involves boundaries between the semantic segments predefined in datasets instead of other edge information. Therefore, it contains the desired boundary layout of semantic objects and could be used for control of the feature propagation throughout image. The results in Table 2 show the effectiveness of the boundary-aware feature propagation (BFP) network. Following [37, 11], we evaluate the performance of BFP near boundaries, as shown in Figure 7. We compute the mIoU for the regions within different bands of boundaries.

DT-EdgeNet [11] is the most related to BFP. However, BFP learns the boundary as one of semantic classes and is trained end-to-end, while DT-EdgeNet learns edge with additional EdgeNet and requires a two-step training process for DeepLab and EdgeNet. As shown in Figure 6, Our learned boundaries response less to object interior edges than DT-EdgeNet. BFP is proposed to perform feature propagation that is some kind of contextual feature modeling, while DT is used to refine the segmentation scores. We use the DT to filter the segmentation scores of BFP and this brings 0.7% performance gain, which shows that the DT and BFP are complementary.

Images       Boundary Confidence Ground Truth         

Figure 6: Qualitative examples of inferred boundary map.
Figure 7: Segmentation performance within band (trimap) around boundaries.
Methods Backbone UAGs Boundary MS mIoU
FCN ResNet-50 41.2
BFP ResNet-50 49.8
BFP ResNet-101 50.8
BFP ResNet-101 52.8
BFP ResNet-101 53.6
Table 2: Ablation Study of Boundary-aware Feature Propagation (BFP) Network on PASCAL-Context. Baseline is dilated FCN and MS means multi-scale testing.

4.4 Comparison with the State-of-the-Art Works

PASCAL-Context [55] provides pixel-wise segmentation annotation for the whole scene image. There are 4998 training images and 5105 testing images in PASCAL-Context. Following [55], we use the most common 59 classes for evaluation. Testing accuracies of PASCAL-Context are shown in Table 3, which shows that the proposed BFP outperforms the state-of-the-art works by a large margin.

Methods mIoU
O2P [10] 18.1
FCN-8s [62] 39.1
BoxSup [16] 40.5
HO-CRF [2] 41.3
PixelNet [4] 41.4
DAG-RNN [66] 43.7
EFCN [65] 45.0
DeepLab-v2+CRF [13] 45.7
RefineNet [50] 47.3
MSCI [49] 50.3
CCL+GMA [18] 51.6
EncNet [82] 51.7
BFP (ours) 53.6
Table 3: Testing accuracies on PASCAL-Context.
Methods mIoU
DeconvNet [56] 48.9
SegNet [3] 50.2
FCN-8s [54] 52.0
DeepLab [12] 54.7
DilatedNet [79] 65.3
Dilation+FSO [44] 66.1
G-FRNet [34] 68.0
Dense-Decoder [7] 70.9
BFP (ours) 74.1
Table 4: Testing accuracies on CamVid.
Methods

road

sidewalk

building

wall

fence

pole

traffic light

traffic sign

vegetation

terrain

sky

person

rider

car

truck

bus

train

motorcycle

bicycle

mIoU
FCN [62] 97.4 78.4 89.2 34.9 44.2 47.4 60.1 65.0 91.4 69.3 93.9 77.1 51.4 92.6 35.3 48.6 46.5 51.6 66.8 65.3
DPN [53] 97.5 78.5 89.5 40.4 45.9 51.1 56.8 65.3 91.5 69.4 94.5 77.5 54.2 92.5 44.5 53.4 49.9 52.1 64.8 66.8
LRR [23] 97.7 79.9 90.7 44.4 48.6 58.6 68.2 72.0 92.5 69.3 94.7 81.6 60.0 94.0 43.6 56.8 47.2 54.8 69.7 69.7
Deeplabv2 [13] 97.9 81.3 90.3 48.8 47.4 49.6 57.9 67.3 91.9 69.4 94.2 79.8 59.8 93.7 56.5 67.5 57.5 57.7 68.8 70.4
RefineNet [50] 98.2 83.3 91.3 47.8 50.4 56.1 66.9 71.3 92.3 70.3 94.8 80.9 63.3 94.5 64.6 76.1 64.3 62.2 70.0 73.6
DepthSet [39] - - - - - - - - - - - - - - - - - - - 78.2
ResNet-38 [73] 98.5 85.7 93.1 55.5 59.1 67.1 74.8 78.7 93.7 72.6 95.5 86.6 69.2 95.7 64.5 78.8 74.1 69.0 76.7 78.4
PSPNet [86] 98.6 86.2 92.9 50.8 58.8 64.0 75.6 79.0 93.4 72.3 95.4 86.5 71.3 95.9 68.2 79.5 73.8 69.5 77.2 78.4
AAF [36] 98.5 85.6 93.0 53.8 59.0 65.9 75.0 78.4 93.7 72.4 95.6 86.4 70.5 95.9 73.9 82.7 76.9 68.7 76.4 79.1
DFN [78] - - - - - - - - - - - - - - - - - - - 79.3
PSANet [87] - - - - - - - - - - - - - - - - - - - 80.1
DenseASPP [76] 98.7 87.1 93.4 60.7 62.7 65.6 74.6 78.5 93.6 72.5 95.4 86.2 71.9 96.0 78.0 90.3 80.7 69.7 76.8 80.6
BFP (ours) 98.7 87.0 93.5 59.8 63.4 68.9 76.8 80.9 93.7 72.8 95.5 87.0 72.1 96.0 77.6 89.0 86.9 69.2 77.6 81.4
Table 5: Category-wise performance comparison on the Cityscapes test set. Note that the DenseAspp [76] uses stronger backbone DenseNet-161 [32] than Resnet-101 [31] we adopt as backbone.

CamVid [8] is a road scene image segmentation dataset which provides dense pixel-wise annotations for 11 semantic categories. There are 367 training images, 101 validation images and 233 testing images. The testing results are shown in Table 4. It shows again that the proposed BFP outperforms previous state-of-the-arts by a large margin.

Cityscapes [15] is a recent street scene dataset which contains 5000 high-resolution (10242048) images with pixel-level fine annotations. There are 2975 training images, 500 validation images and 1525 testing images. 19 classes (e.g., roads, bicycles and cars) are considered for evaluation on the testing sever provided by the organizers. The Category-wise results are shown in Table 5. Our BFP is only trained on fine annotations, while [14] also uses coarse annotations for training. Some segmentation examples on Cityscapes are shown in Figure 8.

5 Conclusion

In this work, we address the challenging issue of scene segmentation. In order to improve the feature similarity of the same segment while keeping the feature discrimination of different segments, we explore to propagate features throughout the image under the control of inferred boundaries. Towards to this, we first propose to learn the boundary as an additional semantic class to enable the network to be aware of the boundary layout. Then, in order to structurize the image to define the order rules for feature propagation, we propose some unidirectional acyclic graphs (UAGs) to model the function of undirected cyclic graphs (UCGs) in a much more efficient way than DAGs. Based on the proposed UAGs, holistic context is aggregated via harvesting and propagating the local features throughout the whole image efficiently. Finally, we propose a boundary-aware feature propagation (BFP) network to detect and utilize the boundary information for controlling the feature propagation of the UAG-structured image. The proposed BFP is capable of improving the similarity of local features belonging to the same segment region while keeping the discriminative power of features belonging to different segments. We evaluate the proposed boundary-aware feature propagation network on three changeling semantic segmentation datasets, PASCAL-Context, CamVid, and Cityscapes, which show that the proposed BFP achieves new state-of-the-art segmentation performance consistently.

Images   Ours Ground Truth            

Figure 8: Qualitative segmentation examples on Cityscapes.

Acknowledgement

This research is supported by Singapore Ministry of Education Academic Research Fund AcRF Tier 3 Grant no: MOE2017-T3-1-001, and by BeingTogether Centre, a collaboration between Nanyang Technological University and University of North Carolina at Chapel Hill. BeingTogether Centre is supported by National Research Foundation, Prime Minister Office, Singapore under its International Research Centres in Singapore Funding Initiative.

References

  • [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2011) Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 898–916. Cited by: §2.2.
  • [2] A. Arnab, S. Jayasumana, S. Zheng, and P. H. Torr (2016) Higher order conditional random fields in deep neural networks. In European Conference on Computer Vision, Cited by: Table 3.
  • [3] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis & Machine Intelligence. Cited by: §2.1, Table 4.
  • [4] A. Bansal, X. Chen, B. Russell, A. Gupta, and D. Ramanan (2016) PixelNet: towards a general pixel-level architecture. arXiv:1609.06694. Cited by: Table 3.
  • [5] G. Bertasius, J. Shi, and L. Torresani (2015) Deepedge: a multi-scale bifurcated deep network for top-down contour detection. In

    The IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4380–4389. Cited by: §2.2, §3.1.
  • [6] G. Bertasius, J. Shi, and L. Torresani (2015) High-for-low and low-for-high: efficient boundary detection from deep object features and its applications to high-level vision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 504–512. Cited by: §2.2, §3.1.
  • [7] P. Bilinski and V. Prisacariu (2018) Dense decoder shortcut connections for single-pass semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 4.
  • [8] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla (2008) Segmentation and recognition using structure from motion point clouds. In European conference on computer vision, Cited by: §3.2, §4.4.
  • [9] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki (2015-06)

    Scene labeling with lstm recurrent neural networks

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [10] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu (2012) Semantic segmentation with second-order pooling. Computer Vision–ECCV 2012. Cited by: Table 3.
  • [11] L. Chen, J. T. Barron, G. Papandreou, K. Murphy, and A. L. Yuille (2016) Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 4545–4554. Cited by: §2.2, §4.3, §4.3.
  • [12] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, Cited by: Table 4.
  • [13] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2016) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915. Cited by: Figure 2, §1, §2.1, §2.1, §2.1, §3.2, §4.1, §4.2, Table 3, Table 5.
  • [14] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv:1802.02611. Cited by: §4.4.
  • [15] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.2, §4.2, §4.4.
  • [16] J. Dai, K. He, and J. Sun (2015) Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: Table 3.
  • [17] H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang (2019-06) Semantic correlation promoted shape-variant context for segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8885–8894. Cited by: §2.1.
  • [18] H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang (2018-06) Context contrasted feature and gated multi-scale aggregation for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2393–2402. Cited by: §3.2, Table 3.
  • [19] P. Dollár and C. L. Zitnick (2013) Structured forests for fast edge detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1841–1848. Cited by: §2.2.
  • [20] P. Dollár and C. L. Zitnick (2015) Fast edge detection using structured forests. IEEE transactions on pattern analysis and machine intelligence 37 (8), pp. 1558–1570. Cited by: §2.2.
  • [21] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 3146–3154. Cited by: §2.1.
  • [22] J. Fu, J. Liu, Y. Wang, Y. Li, Y. Bao, J. Tang, and H. Lu (2019) Adaptive context network for scene parsing. In Proceedings of the IEEE international conference on computer vision, Cited by: §2.1.
  • [23] G. Ghiasi and C. C. Fowlkes (2016) Laplacian pyramid reconstruction and refinement for semantic segmentation. In European Conference on Computer Vision, Cited by: Table 5.
  • [24] S. Gould, R. Fulton, and D. Koller (2009) Decomposing a scene into geometric and semantically consistent regions. In International Conference on Computer Vision, pp. 1–8. Cited by: §2.1.
  • [25] J. Gu, S. Joty, J. Cai, and G. Wang (2018) Unpaired image captioning by language pivoting. In ECCV, Cited by: §2.1.
  • [26] J. Gu, S. Joty, J. Cai, H. Zhao, X. Yang, and G. Wang (2019) Unpaired image captioning via scene graph alignments. In ICCV, Cited by: §2.1.
  • [27] J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling (2019) Scene graph generation with external knowledge and image reconstruction. In CVPR, Cited by: §2.1.
  • [28] Z. Hayder, X. He, and M. Salzmann (2017) Boundary-aware instance segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 5696–5704. Cited by: §2.2.
  • [29] J. He, Z. Deng, and Y. Qiao (2019) Dynamic multi-scale filters for semantic segmentation. In Proceedings of the International Conference on Computer Vision, Cited by: §2.1.
  • [30] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao (2019-06) Adaptive pyramid context network for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1.
  • [31] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3, §4.1, Table 5.
  • [32] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 5.
  • [33] J. Hwang and T. Liu (2015)

    Pixel-wise deep learning for contour detection

    .
    arXiv:1504.01989. Cited by: §2.2, §3.1.
  • [34] M. A. Islam, M. Rochan, N. D. Bruce, and Y. Wang (2017) Gated feedback refinement network for dense image labeling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 4.
  • [35] P. Isola, D. Zoran, D. Krishnan, and E. H. Adelson (2014) Crisp boundary detection using pointwise mutual information. In European Conference on Computer Vision, pp. 799–814. Cited by: §2.2.
  • [36] T. Ke, J. Hwang, Z. Liu, and S. X. Yu (2018-09) Adaptive affinity fields for semantic segmentation. In The European Conference on Computer Vision (ECCV), Cited by: Table 5.
  • [37] P. Kohli, P. H. Torr, et al. (2009) Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision 82 (3), pp. 302–324. Cited by: §4.3.
  • [38] I. Kokkinos (2015) Pushing the boundaries of boundary detection using deep learning. arXiv:1511.07386. Cited by: §2.2.
  • [39] S. Kong and C. C. Fowlkes (2018) Recurrent scene parsing with perspective understanding in the loop. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 5.
  • [40] S. Konishi, A. L. Yuille, J. M. Coughlan, and S. C. Zhu (2003) Statistical edge detection: learning and evaluating edge cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (1), pp. 57–74. Cited by: §2.2.
  • [41] P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, Cited by: §2.1.
  • [42] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, Cited by: §2.1.
  • [43] M. P. Kumar and D. Koller (2010) Efficiently selecting regions for scene understanding. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3217–3224. Cited by: §2.1.
  • [44] A. Kundu, V. Vineet, and V. Koltun (2016) Feature space optimization for semantic video segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 4.
  • [45] D. Larlus and F. Jurie (2008) Combining appearance models and markov random fields for category level object segmentation. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–7. Cited by: §2.1.
  • [46] X. Li, Z. Liu, P. Luo, C. Change Loy, and X. Tang (2017) Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 3193–3202. Cited by: §2.1.
  • [47] X. Liang, X. Shen, D. Xiang, J. Feng, L. Lin, and S. Yan (2016)

    Semantic object parsing with local-global long short-term memory

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [48] J. J. Lim, C. L. Zitnick, and P. Dollár (2013) Sketch tokens: a learned mid-level representation for contour and object detection. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 3158–3165. Cited by: §2.2.
  • [49] D. Lin, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang (2018) Multi-scale context intertwining for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Table 3.
  • [50] G. Lin, A. Milan, C. Shen, and I. Reid (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 3, Table 5.
  • [51] C. Liu, J. Yuen, and A. Torralba (2011) Sift flow: dense correspondence across scenes and its applications. IEEE transactions on pattern analysis and machine intelligence 33 (5). Cited by: §2.1.
  • [52] J. Liu, H. Ding, A. Shahroudy, L. Duan, X. Jiang, G. Wang, and A. K. Chichung (2019)

    Feature boosting network for 3d pose estimation

    .
    IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.1.
  • [53] Z. Liu, X. Li, P. Luo, C. Loy, and X. Tang (2015) Semantic image segmentation via deep parsing network. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.1, Table 5.
  • [54] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1, §4.1, Table 4.
  • [55] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. Yuille (2014) The role of context for object detection and semantic segmentation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.4.
  • [56] H. Noh, S. Hong, and B. Han (2015) Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §2.1, Table 4.
  • [57] R. Pascanu, T. Mikolov, and Y. Bengio (2013) On the difficulty of training recurrent neural networks. In

    International Conference on Machine Learning

    ,
    pp. 1310–1318. Cited by: §1, §3.2.
  • [58] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147. Cited by: §2.1.
  • [59] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo (2018) Erfnet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems 19 (1), pp. 263–272. Cited by: §2.1.
  • [60] S. Rota Bulò, L. Porzi, and P. Kontschieder (2018) In-place activated batchnorm for memory-optimized training of dnns. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 5639–5647. Cited by: §2.1.
  • [61] C. Russell, P. Kohli, P. H. Torr, et al. (2009) Associative hierarchical crfs for object class image segmentation. In Computer Vision, 2009 IEEE 12th International Conference on, pp. 739–746. Cited by: §2.1.
  • [62] E. Shelhamer, J. Long, and T. Darrell (2016) Fully convolutional networks for semantic segmentation. IEEE transactions on pattern analysis and machine intelligence. Cited by: Table 3, Table 5.
  • [63] W. Shen, B. Wang, Y. Jiang, Y. Wang, and A. Yuille (2017)

    Multi-stage multi-recursive-input fully convolutional networks for neuronal boundary detection

    .
    In Proceedings of the IEEE International Conference on Computer Vision, pp. 2391–2400. Cited by: §2.2.
  • [64] W. Shen, X. Wang, Y. Wang, X. Bai, and Z. Zhang (2015) Deepcontour: a deep convolutional feature learned by positive-sharing loss for contour detection. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 3982–3991. Cited by: §2.2, §3.1.
  • [65] B. Shuai, H. Ding, T. Liu, G. Wang, and X. Jiang (2019) Toward achieving robust low-level and high-level scene parsing. IEEE Transactions on Image Processing 28 (3), pp. 1378–1390. Cited by: Table 3.
  • [66] B. Shuai, Z. Zuo, B. Wang, and G. Wang (2018) Scene segmentation with dag-recurrent neural networks. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1480–1493. Cited by: §1, §2.1, §3.2, §4.2, Table 3.
  • [67] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Cited by: §4.2.
  • [68] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [69] J. Tighe and S. Lazebnik (2013) Finding things: image parsing with regions and per-exemplar detectors. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [70] T. Wang, Y. Piao, X. Li, L. Zhang, and H. Lu (2019) Deep learning for light field saliency detection. In Proceedings of the International Conference on Computer Vision, Cited by: §2.1.
  • [71] T. Wang, L. Zhang, H. Lu, C. Sun, and J. Qi (2016) Kernelized subspace ranking for saliency detection. In ECCV, pp. 450–466. Cited by: §2.1.
  • [72] T. Wang, L. Zhang, S. Wang, H. Lu, G. Yang, X. Ruan, and A. Borji (2018) Detect globally, refine locally: a novel approach to saliency detection. In The IEEE Conference on Computer Vision and Pattern Recognition, pp. 3127–3135. Cited by: §2.1.
  • [73] Z. Wu, C. Shen, and A. Van Den Hengel (2019) Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recognition. Cited by: Table 5.
  • [74] S. Xie and Z. Tu (2015) Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pp. 1395–1403. Cited by: §2.2, §3.1.
  • [75] J. Yang, B. Price, S. Cohen, and M. Yang (2014) Context driven scene parsing with attention to rare classes. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [76] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang (2018) DenseASPP for semantic segmentation in street scenes. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 5.
  • [77] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) BiSeNet: bilateral segmentation network for real-time semantic segmentation. In The European Conference on Computer Vision (ECCV), Cited by: §2.1.
  • [78] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Learning a discriminative feature network for semantic segmentation. arXiv:1804.09337. Cited by: Table 5.
  • [79] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122. Cited by: §1, §2.1, §3.2, Table 4.
  • [80] Y. Zeng, H. Lu, L. Zhang, M. Feng, and A. Borji (2018) Learning to promote saliency detectors. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [81] Y. Zeng, Y. Zhuge, H. Lu, and L. Zhang (2019) Joint learning of saliency detection and weakly supervised semantic segmentation. In Proceedings of the International Conference on Computer Vision, Cited by: §2.1.
  • [82] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018) Context encoding for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1, §3.2, Table 3.
  • [83] L. Zhang, J. Dai, H. Lu, Y. He, and G. Wang (2018) A bi-directional message passing model for salient object detection. In The IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
  • [84] L. Zhang, Z. Lin, J. Zhang, H. Lu, and Y. He (2019) Fast video object segmentation via dynamic targeting network. In Proceedings of the International Conference on Computer Vision, Cited by: §2.1.
  • [85] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia (2018-09) ICNet for real-time semantic segmentation on high-resolution images. In The European Conference on Computer Vision (ECCV), Cited by: §2.1.
  • [86] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1, §3.2, Table 5.
  • [87] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia (2018) PSANet: point-wise spatial attention network for scene parsing. In ECCV, Cited by: Table 5.