Bi-direction Context Propagation Network for Real-time Semantic Segmentation

05/22/2020 ∙ by Shijie Hao, et al. ∙ Hefei University of Technology 0

Spatial details and context correlations are two types of critical information for semantic segmentation. Generally, spatial details are most likely existed in shallow layers, but context correlations are most likely existed in deep layers. Aiming to use both of them, most of current methods choose forward transmitting the spatial details to deep layers. We find spatial details transmission is computationally expensives, and substantially lowers the model's execution speed. To address this problem, we propose a new Bi-direction Contexts Propagation Network (BCPNet), which performs semantic segmentation in real-time. Different from the previous methods, our BCPNet effectively back propagate the context information to the shallow layers, which is more computationally modesty. Extensive experiments validate that our BCPNet has achieved a good balance between accuracy and speed. For accuracy, our BCPNet has achieved 68.4 % IoU on the Cityscapes test set and 67.8 the CamVid test set. For speed, our BCPNet can achieve 585.9 FPS and 1.7 ms runtime per an image.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Semantic segmentation is one of the most challenging tasks of computer vision, which aims to partition an image into several non-overlapping regions according to the category of each pixel. As its unique role in visual processing, many real-world applications rely on this technology such as self-driving vehicle

(Siam et al., 2017; Treml et al., 2016), medical image analysis (Guo et al., 2015, 2016) and image editing (Tsai et al., 2016). Some of these applications not only require high segmentation accuracy but also fast execution speed, which makes the task of semantic segmentation more challenging. However, in recent years, the balance between segmentation accuracy and speed is still far from satisfactory.

For a semantic segmentation algorithm, there are two key points: 1) spatial details maintaining and 2) context information aggregating. In general, spatial details are more likely existed in the shallow layers, but context information is more likely existed in the deep layers. Most of current methods realize the two key points by 1) keeping high-resolution feature maps in the network pipeline to maintain spatial details and 2) using the dilated convolution (Yu and Koltun, 2015) to aggregate context information, e.g., (Liu et al., 2015; Zhang et al., 2018; He et al., 2019; Chen et al., 2017; Zhao et al., 2017). These methods can be seen as a task of information transmission: spatial details are transmitted from the shallow layers to the deep layers, as shown in Fig.1

(a). A large number of works have validated that appropriately cooperating low-level spatial details and high-level context information can make semantic segmentation more accurate. However, keeping relative high-resolution feature maps in deep neural networks tends to bring about high computational costs. Moreover, this will substantially lower the model’s execution speed. To quantify the influence of feature resolutions in the network pipeline, we conduct an simple experiment on the famous ResNet

(He et al., 2016) framework. For a fair comparison, the last fully connected layer of ResNet is removed. As shown in Fig.2, when feature resolutions enlarge, the model’s FLOPs substantially increase, and, correspondingly, the model’s execution speed gets much lower.

Figure 1.

Illustrations of the pervious spatial details transmission and our context backpropagation mechanism.

Figure 2. Influence of feature resolutions maintained in the network pipeline for FLOPS and speed. “maintaining rate” represents the resolution rate between the final maintained feature map and the input image. “res-50” represents the ResNet-50 network. “res-101” represents the ResNet-101 network.

Considering the worse balance between accuracy and speed achieved by maintaining spatial details, we propose a new Bi-direction Contexts Propagation Network (BCPNet). Different from the pervious methods, which transmits spatial details, our BCPNet is designed to effectively back propagate the context information within the network pipeline, as shown in Fig.1 (b). By back propagating the context information, which is aggregated from the deep layers, to the shallow layers, the shallow-layer features become context-aware enough, as shown in Fig.4. Therefore, the shallow-layer features can be directly used for the final prediction. The changes of the transmitted information type and the transmission direction eliminate the need for keeping high-resolution feature maps in the network pipeline. This makes us can design a network with less FLOPs and a faster execution speed, but at a relatively high segmentation accuracy. Based on the proposed context backpropagation mechanism, we propose a new Bi-direction Context Propagation Network (BCPNet), which is shown in Fig.3. The key component of our BCPNet is the Bi-direction Context Propagation (BCP) module, as is shown in Fig.3 (b), which uses top-down and bottom-up paths to back propagate the context information to shallow layers. Our BCPNet is light-weight, which just totally contains about 0.61 M parameters. Extensive experiments validate that our BCPNet has achieved a good balance between segmentation accuracy and execution speed. For example, as for execution speed, our BCPNet can achieve 585.9 FPS on input images. Even for input input images, our BCPNet can achieve still 55 FPS. As for segmentation accuracy, our BCPNet has achieved 68.4 % mIoU on the cityscapes (Cordts et al., 2016) test set and 67.8 % mIoU on the CamVid (Brostow et al., 2009) test set.

The contributions of this paper can be summarized as following aspects:

  • First, compared to the pervious methods, which forward transmit spatial details within the network pipeline, we introduce a new context backpropagation mechanism. This benefits for improving the balance between accuracy and speed.

  • Second, based on the proposed context backpropagation mechanism, we propose a new BCPNet that realizes context backpropagation effectively by using top-down and bottom-up paths.

  • Third, our BCPNet has achieved a state-of-the-art balance for accuracy and speed.

Figure 3. Overview of our BCPNet. “w-sum” represents the weighted sum. “s-conv” represents the separable convolution.

The remainders of this paper are organized as follows. First, we introduce the related work in section 2. then, we provide the details of our method in section 3 followed by our experiments in section 4. Finally, in section 5, we conclude the paper.

2. Related work

In this section, we review the methods which are related to us. First, we review the methods based on transmitting spatial details within the network pipeline. Then, we review the methods focusing on improving execution speed.

2.1. Spatial details transmission

DilatedNet (Yu and Koltun, 2015) is a pioneering work based on spatial details transmission. Aiming to forward transmit spatial details in the network pipeline (Fig.1 (a)), (Yu and Koltun, 2015) introduced the dilated convolution. The dilated convolution enlarges the receptive field of convolutional operations by inserting holes to convolution kernels. Therefore, some of downsampling operations, like pooling, can be removed in the network. this avoids the decrease of feature resolutions and the loss of spatial details. Following (Yu and Koltun, 2015), a large number of variants have been proposed such as (Zhang et al., 2018; He et al., 2019; Chen et al., 2017; Zhao et al., 2017). In particular, most of these methods pay their attention on further improving the context representations of the last convolution layer. For example, (Zhao et al., 2017) introduced the Pyramid Pooling Module (PPM), in which multiple parallel average pooling branches are applied to the last convolution layer aiming to aggregate more context correlations. (Chen et al., 2017) extended PPM to Atrous Spatial Pyramid Pooling (ASPP) module by further introducing the dilated convolution (Yu and Koltun, 2015). By using the dictionary learning to learn the global context embedding, (Zhang et al., 2018) introduced EncNet. Recently, He et al. (He et al., 2019) proposed Adaptive Pyramid Context Network (APCNet), in which the Adaptive Context Module (ACM) is used to aggregate pyramid context information.

Discussion. However, aiming to use both of spatial details and context correlations, the above methods all choose transmitting spatial details from the deep layers to the shallow layers, as shown in Fig.1 (a). The spatial details transmission within the network pipeline is computationally expensive, and substantially lowers the model’s execution speed (as shown in Fig.2). Although these methods improve segmentation accuracy, the balance between accuracy and speed is still far from satisfactory. Take DeepLab (Chen et al., 2017) as an example. Based on an input image, it totally includes 457.8 G FLOPs. Its speed can only arrive at 0.25 FPS. This means, for each image, it will cost 4000 ms runtime (Table.1 and Table.2).

2.2. Execution speed improvement

Some real-world applications, such as self-driving vehicle, do not only require accurate segmentation but also fast execution speed. Therefore, it is critical to achieve a better balance between accuracy and speed. For example, (Badrinarayanan et al., 2017) proposed SegNet by scaling the network to a small one. SegNet contains 29.5 M parameters, and can achieve 14.7 FPS based on input images. Moreover, it has achieved 56.1 mIoU % on the Cityscapes test set. (Paszke et al., 2016) proposed ENet by employing a tight framework. ENet contains 0.4 M parameters, and can achieve 135.4 FPS based on images. Moreover, ENet has achieved 58.3% mIoU on the Cityscapes test set. (Zhao et al., 2018a) proposed ICNet by using the strategy of multi-scale inputs and cascaded frameworks to construct a light-weight network. ICNet contains 26.5 M parameters, and can achieve 30.3 FPS based on input images. Moreover, ICNet has achieved 67.1 % mIoU on the CamVid (Brostow et al., 2009) test set. (Yu et al., 2018) proposed BiSeNet by independently constructing a spatial path and a context path. The context path is used to extract high-level context information, and the spatial path is used to maintain low-level spatial details. BiSeNet has achieved 68.4% mIoU at 72.3 FPS on images of the Cityscapes test set. Recently, (Li et al., 2019) proposed DFANet based on the strategy of feature reuse. DFANet has achieved 71.3 % mIoU at 100 FPS on input images of the Cityscapes test set. By employing an asymmetric convolution structure and dense connections, (Lo et al., 2019) proposed EDANet, which only includes 0.68 M parameters. EDANet can achieve 81.3 FPS based on input images. Moreover, it still has achieved 68.4 % mIoU on the Cityscapes test set.

Discussion. The current methods for improving execution speed usually ignore the significant role of context information. Most of them is devoted in scaling the network to a small one, such as (Badrinarayanan et al., 2017; Paszke et al., 2016; Zhao et al., 2018a; Li et al., 2019). Although BiSeNet (Yu et al., 2018) constructs a spatial path and a context path to learn spatial details and context correlations respectively, this is still computationally expensive. For example, BiSeNet-1 contains 4.8 M parameters and BiSeNet-2 contains 49 M parameters. As for input images, BiSeNet-1 involves 14.8 G FLOPs and BiSeNet-2 involves 55.3 G FLOPs. Different from these methods, in this paper, we aim to use the proposed context propagation mechanism to construct a real-time semantic segmentation framework.

3. Proposed method

In this section, we will introduce our method. First, we provide an overview for our BCPNet. Then, we introduce the key component the BCP module in details.

3.1. Architecture of BCPNet

As shown in Fig.3, our BCPNet is an variant of encoder-decoder framework. As for the encoder, we choose the MobileNet (Sandler et al., 2018) as our backbone network. In the backbone network, we do not use any dilated convolutions to keep the resolutions of the feature map. Instead, the feature map is fast downsampled to the resolution of the input image. This makes the network computationally modest. Considering that a light-weight network cannot learn a complex representation of context, we choose a simple context aggregation approach liking (Zhao et al., 2017)

. But differently, first, we formulate the process of context aggregation in a serial manner instead of a parallel manner. Second, we choose max pooling to aggregate context information instead of average pooling. As for our network, we find that using max pooling to aggregate context yields a better performance (Table.

5). As Fig.3 describes, the output of the backbone network is fed into two max pooling operations successively, in which the kernel size is set as

and stride is set as

. Therefore, the context map is further condensed into , which is more global. Then, the proposed BCP module (Fig.3 (b)) is employed to back propagate the aggregated context information to shallow layers, which effectively makes the shallow-layer features context-aware enough. In our experiments, we set Layer-3 as the last shallow layer in the transmission. Finally, we directly use the features from Layer-3 to output the final prediction.

Our BCPNet only contains 0.61 M parameters. This means it is suitable even to be applied to mobile devices, such as mobile phone. Based on the light-weight architecture, our BCPNet can achieve 585.9 FPS based on the input images. That means that, for an image, it only costs 1.7 ms runtime. Although our BCPNet is light-weight, it still has achieved a promising performance in accuracy. For example, it has achieved 68.4 % mIoU on the Cityscapes test set and 67.8 % mIoU on the CamVid test.

3.2. Details of BCP module

The key component of our BCPNet is the BCP module, which is shown in Fig.3 (b). Our BCP module is composed of two top-down paths and one bottom-up path. The top-down path is designed to back propagate the context information to shallow layers. The bottom-up path is designed to boast the fusion process between spatial details and context information. In particular, as shown in Fig.3 (b), we formulate the top-down path and the bottom-up path in an alternate manner. For each layer of top-down and bottom-up paths, it contains two components, i.e., the weighted sum and the separable convolution (Chollet, 2017). As Eq.1 describes, the weighted sum summarize the information of neighboring layers by using learnable scaling factors.

(1)

in which, represents the features from the , which contains relatively more spatial details, and represents features from the , which contains more context correlations. and are learnable factors for and , respectively.

Our BCP module is also light-weight, which only contains 0.18 M parameters. But the effectiveness of the BCP module is . We visualize the feature maps before and after processed by the BCP module, which are shown in Fig.4. We find the feature map (Fig.4

(b)), before the BCP module’s processing, contains a large number details, such as contour and texture. But it cannot be used for the final prediction. The reasons can be summarized as two aspects. First, It includes too many noises, which seriously interfere the inference of the final classifier. Second, lacking context information make the classifier have a worse perception for semantic regions. But, after the processing of the BCP module, the feature maps (Fig.

4 (c)), are get more semantic-aware. For example, they pay more attentions to some salient semantic regions, such as person, car, and tree. Moreover, it obviously has less noises. The ablation in Section.4.4 also validate the effectiveness of our BCP moudle.

Figure 4. Visualization of the feature maps of Layer-3 before and after processed by our BCP module. In particular, as for (b) and (c), red (blue) color represents the pixel has a higher (lower) response.

4. Experiments

In this section, we conduct entensive experiments to validate the effectiveness of our method. First, we provide analysis based on parameters and FLOPs, speed, and accuracy. Then, we provide an ablation study to investady the influence of each component.

4.1. Parameters and FLOPs analysis

The number of parameters and FLOPs is an important evaluation criterion for real-time semantic segmentation. Therefore, in this section, we provide extensive parameters and FLOPs analysis for our method, which is summarized in Table.1. For a better comparison, we classify the current related methods on four categories, i.e., large model, middle model, small model, and tiny model. 1) The large model means the network’s parameters and FLOPs are more than 200 M and 300 G, respectively . 2) The middle model means the network’s FLOPs are between 100 G and 300 G. 3) The small model means the network’s parameters are between 1 M and 100 M, or FLOPs are between 10 G and 100 G. 4) The tiny model means the network’s parameters are less than 1 M and FLOPs are less than 10 G. Our BCPNet totally contains 0.61 M parameters. Even for the input image, it only involves 4.5 G FLOPs. Therefore, as shown in Table.1, we classify our BCPNet into the category of tiny model.

Compared with small models, our BCPNet shows a consistent better performance. For example, as for TwoColumn (Wu et al., 2017), it has about 50 times FLOPs than us. As for BiSeNet (Yu et al., 2018), our BCPNet only has about 1.2 % parameters and 4.5 % FLOPs of BiSeNet-2, and has about 10 % parameters and 17 % FLOPs of BiSeNet-1. As for ICNet (Zhao et al., 2018a), our BCPNet has reduced about 98 % parameters and 84 % FLOPs. As for the two versions of DFANet (Li et al., 2019), our FLOPs are comparable. But we have much less parameters, e.g., our BCPNet just has about 13 % FLOPs of DFANet-B and 8 % FLOPs of DFANet-A.

Compared with other tiny models, our BCPNet still has a better performance. As for ENet (Paszke et al., 2016), although our BCPNet has more parameters (approximately 0.2 M more), it just has about 13 % FLOPs of ENet. As for EDANet (Lo et al., 2019), our parameter number is comparable. But our BCPNet just has about 13 % FLOPs of EDANet.

[0.8pt] Method Params
Large Model
( and )
DeepLab (Chen et al., 2017) 262.1 M - - 457.8 G - - -
PSPNet (Zhao et al., 2017) 250.8 M - 412.2 G - - -
1-8[4pt/2pt] Middle Model
()
SQ (Treml et al., 2016) - - - - - - 270 G
FRRN (Pohlen et al., 2017) - - - 235 G - - -
FCN-8S (Long et al., 2015) - - - 136.2 G - - -
1-8[4pt/2pt] Small Model
( or )
TwoColumn (Wu et al., 2017) - - - 57.2 G - -
BiSeNet-2 (Yu et al., 2018) 49 M - - - 55.3 G - -
SegNet (Badrinarayanan et al., 2017) 29.5 M 286 G - - - - -
ICNet (Zhao et al., 2018a) 26.5 M - - - - - 28.3 G
DFANet-A (Li et al., 2019) 7.8 M - - - - 3.4 G -
BiSeNet-1 (Yu et al., 2018) 5.8 M - - - 14.8 G - -
DFANet-B (Li et al., 2019) 4.8 M - - - - 2.1 G -
1-8[4pt/2pt] Tiny Model
( and )
EDANet (Lo et al., 2019) 0.68 M - - 8.97 G - - -
ENet (Paszke et al., 2016) 0.4 M 3.8 G - - - - -
Our BCPNet 0.61 M 0.51 G 1.12 G 1.13 G 2.53 G 2.25 G 4.50 G
[0.8pt]
Table 1. FLOPs analysis for our method.

4.2. Speed analysis

In this section, we provide extensive speed analysis for our method, which is summarized in Table.2. In particular, as for Table.2, all speed results are calculated on a single GTX TITAN X GPU card.

Compared with small models, our BCPNet presents a consistent faster execution speed. For example, as for TwoColumn (Wu et al., 2017), our BCPNet has achieved about 235.7 FPS faster based on input images. As for the two versions of BiSeNet (Yu et al., 2018), our BCPNet is faster for all different input resolutions. For example, based on input images, our BCPNet has achieved about 456.5 FPS faster than BiSeNet-2 and 382.4 FPS faster than BiSeNet-1. Based on input images, our BCPNet has achieved about 86.9 FPS faster than BiSeNet-2 and 52.5 FPS faster than BiSeNet-1. As for ICNet (Zhao et al., 2018a), which has achieved 30.3 FPS based on input images, our BCPNet is faster by about 25 FPS. As for two versions of DFANet (Li et al., 2019), our BCPNet has achieved about 60 FPS faster than DFANet-A and 20 FPS faster than DFANet-B based on input images. For input images, the BCPNet and DFANet-B are comparable, but it is faster than DFANet-A by about 16 FPS.

Compared with tiny models, our method still yields a better performance. For example, as for ENet, our BCPNet has achieved about 450 FPS faster based on the input images and about 88 FPS faster based on input images. For EDANet, based on input images, our BCPNet is faster by about 169 FPS.

[0.8pt] Method Params
ms fps ms fps ms fps ms fps ms fps ms fps ms fps ms fps
Large Model
( and )

DeepLab (Chen et al., 2017)
262.1 M - - 4000 0.25 - - - - - - - - - - - -
1-18[4pt/2pt] Middle Model
()
SQ (Treml et al., 2016) - - - - - - - - - - - - - - - 60 16.7
FCN-8S (Long et al., 2015) - - - 500 2 - - - - - - - - - - - -
FRRN (Pohlen et al., 2017) - - - 469 2.1 - - - - - - - - - - - -
1-18[4pt/2pt] Small Model
( or )
TwoColumn (Wu et al., 2017) - - - 68 14.7 - - - - - - - - - -
BiSeNet-2 (Yu et al., 2018) 49 M 8 129.4 - - - - 21 47.9 21 45.7 43 23 - - - -
SegNet (Badrinarayanan et al., 2017) 29.5 M 69 14.6 - - - - 289 3.5 - - 637 1.6 - - - -
ICNet (Zhao et al., 2018a) 26.5 M - - - - 36 27.8 - - - - - - - - 33 30.3
DFANet-A (Li et al., 2019) 7.8 M - - - - 8 120 - - - - - - 10 100 - -
BiSeNet-1 (Yu et al., 2018) 5.8 M 5 203.5 - - - - 12 82.3 13 72.3 24 41.4 - - - -
DFANet-B (Li et al., 2019) 4.8 M - - - - 6 160 - - - - - - 8 120 - -
1-18[4pt/2pt] Tiny Model
( and )
EDANet (Lo et al., 2019) 0.68 M - - 12.3 81.3 - - - - - - - - - - - -
ENet (Paszke et al., 2016) 0.4 M 7 135.4 - - - - 21 46.8 - - 46 21.6 - - - -
our BCPNet (Paszke et al., 2016) 0.61 M 1.7 585.9 4 250.4 5.5 181 7.4 134.8 9.8 102.6 18.2 55 8.6 116.2 18.2 55
[0.8pt]
Table 2. Speed analysis for our BCPNet. In particular, all results of execution speed are calculated on a single TITAN X GPU card.

4.3. Accuracy analysis

In this section, we provide an accuracy analysis for our method. First, we introduce the implement details for our experiments. And then, we compare our method’s accuracy with the current state-of-the-art methods on the Cityscapes (Cordts et al., 2016) and CamVid (Brostow et al., 2009) datasets.

4.3.1. Implement details

We build our codes based on the pytorch platform

111https://pytorch.org. Following (He et al., 2019; Chen et al., 2017), we use the “poly” learning rate strategy, i.e., . For all experiments, we set the initial learning rate as and the power as

. Aiming to reduce the risk of over-fitting, we adopt data augmentation in our experiments. For example, we randomly flip and scale the input image from 0.5 to 2. We choose the Stochastic Gradient Descent (SGD) as our training optimizer, in which the momentum is set as

and weight decay is set as . For the CamVid dataset, we set the crop size and mini-batch as and 48. For the Cityscapes dataset, we set the crop size as

. However, due to limited GPU resources, we set the mini-batch as 36. For all experiments, the training process will end after 200 epochs.

4.3.2. Cityscapes

The Cityscapes (Cordts et al., 2016) dataset is composed of 5000 fine-annotated images and 20000 coarse-annotated images. In our experiments, only the fine-annotated subset is used. The dataset totally includes 30 semantic classes. Following (Zhao et al., 2018b, 2017), we only use 19 classes of them. The fine-annotated subset contains 2975 images for training, 500 images for validation, and 1525 images for testing.

Performance. Compare with middle models, the accuracy of our method is slightly worse, but it is still good enough. For example, as for SQ (Treml et al., 2016), which has 60 times FLOPs than us, the accuracy of our BCPNet is still higher by about 8.6 %. As for FRRM (Pohlen et al., 2017), it has achieved 3.4% accuracy higher than our BCPNet. But our BCPNet just has about 0.5 % FLOPs of FRRN. Moreover, based on input images, our BCPNet is faster than FRRN by about 248 FPS. As for FCN-8s (Long et al., 2015), which has about 120 times FLOPs than us, the accuracy of BCPNet is still higher by about 5.3 %.

Compared with small models, the BCPNet has achieved comparable accuracy with ICNet (Zhao et al., 2018a), DFANet-B (Li et al., 2019), and BiSeNet-1 (Yu et al., 2018). But our BCPNet has much less parameters. For example, ICNet has about 43 times parameters than us, DFANet-B has about 7.8 times parameters than us, and BiSeNet-1 has about 9.5 times parameters than us. As for TwoColumn (Wu et al., 2017), whose FLOPs are about 50 times than us, our BCPNet has about 4.5 % accuracy lower. As for BiSeNet-2 (Yu et al., 2018), whose parameters are about 80 times than us, our BCPNet has about 6.3 % accuracy lower. But we find, our BCPNet is much faster than TwoColumn and BiSeNet-2. For example, based on input images, our method has achieved about 235 FPS faster than TwoColumn. Based on input images, our method has achieved about 456 FPS faster. However, we find our method’s accuracy is still much higher than SegNet (Badrinarayanan et al., 2017) (about 12.3 % higher) whose parameters and FLOPs are about 48 and 560 times than us.

Compared with other tiny models, the BCPNet presents a consistent better performance. For example, as for ENet (Paszke et al., 2016), our method’s accuracy is higher by about 10.1 %. As for EDANet (Lo et al., 2019), our method’s accuracy is higher by about 1.1 %.

We visualize some segmentation results of our method on Fig.5.

Figure 5. Visualized segmentation results of our method on the Cityscapes dataset.
[0.8pt] Method Params mIoU

Middle:
SQ (Treml et al., 2016) - 59.8
FRRN (Pohlen et al., 2017) - 71.8
FCN-8S (Long et al., 2015) - 63.1

1-3[4pt/2pt]Small:
TwoColumn (Wu et al., 2017) - 72.9
BiSeNet-2 (Yu et al., 2018) 49 M 74.7
SegNet (Badrinarayanan et al., 2017) 29.5 M 56.1
ICNet (Zhao et al., 2018a) 26.5 M 69.5
DFANet-A (Li et al., 2019) 7.8 M 71.3
BiSeNet-1 (Yu et al., 2018) 5.8 M 68.4
DFANet-B (Li et al., 2019) 4.8 M 67.1
1-3[4pt/2pt]Tiny:
EDANet (Lo et al., 2019) 0.68 M 67.3
ENet (Paszke et al., 2016) 0.4 M 58.3
Our BCPNet 0.61 M 68.4
[0.8pt]
Table 3. Results of our method on the Cityscapes test set.

4.3.3. CamVid

The CamVid dataset (Brostow et al., 2009) is collected from high-resolution video sequences of road senses. This dataset contains 367 images for training, 101 images for validation, and 233 images for testing. The dataset totally includes 32 semantic classes. But Following (Zhao et al., 2018a; Li et al., 2019), only 11 classes of them are used in our experiments.

Performance. Although our BCPNet contains light-weight parameters and FLOPs, it’s accuracy is higher than most of the current state-of-the-art methods. Compared with DeepLab (Chen et al., 2017), which parameters and FLOPs are about 429 and 408 times, the accuracy of the BCPNet is still higher by about 6.2 %. Compared to SegNet (Badrinarayanan et al., 2017), whose parameters and FLOPs are about 48 and 560 times than us, our BCPNet’s accuracy is still higher by about 12.3 %. Compared with ICNet (Zhao et al., 2018a), whose parameters and FLOPs are about 43 and 6.2 times than us, our BCPNet’s accuracy is higher by about 0.7 %. Compared with two versions of DFANet (Li et al., 2019), our BCPNet presents a consistent better performance. For example, as for DFANet-A, whose parameters are about 12.7 times than us, our method’s accuracy is higehr by about 67.8 %. as for DFANet-B, whose parameters are about 7.8 times than us, our method’ accuracy is higher by about 8.5 %. Compared with two tiny models, our method has achieved about 1.1 % accuracy higher than EDANet (Lo et al., 2019), and about 10.1 higher % than ENet (Paszke et al., 2016). Although our method’ accuracy is lower about 6.3 % than BiSeNet-2 (Yu et al., 2018), our BCPNet just has about 1.2 % parameters of BiSeNet-1. Moreover, compared with BiSeNet-2, whose parameters are about 9 times than us, the accuracy of our BCPNet is higher by about 1.3 %.

[0.8pt] Method Params mIoU
Large:
DeepLab (Chen et al., 2017) 262.1 M 61.6

1-3[4pt/2pt]Small:
BiSeNet-2 (Yu et al., 2018) 49 M 68.7
SegNet (Badrinarayanan et al., 2017) 29.5 M 46.4
ICNet (Zhao et al., 2018a) 26.5 M 67.1
DFANet-A (Li et al., 2019) 7.8 M 64.7
BiSeNet-1 (Yu et al., 2018) 5.8 M 65.6
DFANet-B (Li et al., 2019) 4.8 M 59.3

1-3[4pt/2pt]Tiny:
EDANet (Lo et al., 2019) 0.68 M 66.4
ENet (Paszke et al., 2016) 0.4 M 51.3
Our BCPNet 0.61 M 67.8
[0.8pt]
Table 4. Results of our method on the CamVid test set.

4.4. Ablation study

In this section, we conduct an ablation study to investigate the influence of the component of our method for final accuracy. As shown in Table.5, we find without our context backpropagration mechanism the backbone network, which contains 0.43 M parameters, can only achieve mIoU on the Cityscapes validation set. By applying our BCP module, the model’s accuracy is improved to 67.842 % mIoU (about 9 % higher) with only 0.18 M parameters increase. This demonstrates the effectiveness of our BCP module. We further investigate the influence of different pooling operations used in context aggregation. We find using max pooling yields a better performance. When we replace the max pooling with average pooling, the performance decreases to 67.311 % mIoU (abot 0.5 % lower). When we replace the max pooling with max pooling, the performance decreases to 65.763 % mIoU (about 2 % lower). We think this is caused by the mismatch between the (relatively) large kernel size and the (relatively) small feature resolution. Crop size plays an important role for the final accuracy, which has been mentioned by (He et al., 2019). We find using larger crop size yields a better performance. When we adopt crop size during the training process, the performance improved to 68.626 % mIoU.

[0.8pt] Backbone Crop size Pooling BPC Params mIoU
0.43 M 58.891
max 0.61 M 67.842
1-6[4pt/2pt] ✓ avg 0.61 M 67.311
max 0.61 M 65.763
1-6[4pt/2pt] ✓ max 0.61 M 68.626
[0.8pt]
Table 5. An ablation study of our method on the Cityscapes validation set.

5. Conclusion

In this paper, we propose a new Bi-direction Contexts Propagation Network (BCPNet) based on the proposed context backpropagation mechanism. We find back propagating context information to shallow layers is more computationally modesty than the traditional spatial details transmission method. This makes our BCPNet can perform semantic segmentation in real-time. Extensive experiments validate that our BCPNet has achieved a new state-of-the-art balance between accuracy and speed.

References

  • V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §2.2, §2.2, §4.3.2, §4.3.3, Table 1, Table 2, Table 3, Table 4.
  • G. J. Brostow, J. Fauqueur, and R. Cipolla (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: §1, §2.2, §4.3.3, §4.3.
  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §1, §2.1, §2.1, §4.3.1, §4.3.3, Table 1, Table 2, Table 4.
  • F. Chollet (2017)

    Xception: deep learning with depthwise separable convolutions

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258. Cited by: §3.2.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §1, §4.3.2, §4.3.
  • Y. Guo, P. Dong, S. Hao, L. Wang, G. Wu, and D. Shen (2016) Automatic segmentation of hippocampus for longitudinal infant brain mr image sequence by spatial-temporal hypergraph learning. In International Workshop on Patch-based Techniques in Medical Imaging, pp. 1–8. Cited by: §1.
  • Y. Guo, Y. Gao, and D. Shen (2015)

    Deformable mr prostate segmentation via deep feature learning and sparse patch matching

    .
    IEEE Transactions on Medical IDmaging 35 (4), pp. 1077–1089. Cited by: §1.
  • J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao (2019) Adaptive pyramid context network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7519–7528. Cited by: §1, §2.1, §4.3.1, §4.4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §1.
  • H. Li, P. Xiong, H. Fan, and J. Sun (2019) Dfanet: deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9522–9531. Cited by: §2.2, §2.2, §4.1, §4.2, §4.3.2, §4.3.3, §4.3.3, Table 1, Table 2, Table 3, Table 4.
  • W. Liu, A. Rabinovich, and A. C. Berg (2015) Parsenet: looking wider to see better. arXiv preprint arXiv:1506.04579. Cited by: §1.
  • S. Lo, H. Hang, S. Chan, and J. Lin (2019) Efficient dense modules of asymmetric convolution for real-time semantic segmentation. In Proceedings of the ACM Multimedia Asia on ZZZ, pp. 1–6. Cited by: §2.2, §4.1, §4.3.2, §4.3.3, Table 1, Table 2, Table 3, Table 4.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §4.3.2, Table 1, Table 2, Table 3.
  • A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) Enet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147. Cited by: §2.2, §2.2, §4.1, §4.3.2, §4.3.3, Table 1, Table 2, Table 3, Table 4.
  • T. Pohlen, A. Hermans, M. Mathias, and B. Leibe (2017) Full-resolution residual networks for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4151–4160. Cited by: §4.3.2, Table 1, Table 2, Table 3.
  • M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §3.1.
  • M. Siam, S. Elkerdawy, M. Jagersand, and S. Yogamani (2017) Deep semantic segmentation for automated driving: taxonomy, roadmap and challenges. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), pp. 1–8. Cited by: §1.
  • M. Treml, J. Arjona-Medina, T. Unterthiner, R. Durgesh, F. Friedmann, P. Schuberth, A. Mayr, M. Heusel, M. Hofmarcher, M. Widrich, et al. (2016) Speeding up semantic segmentation for autonomous driving. In MLITS, NIPS Workshop, Vol. 2, pp. 7. Cited by: §1, §4.3.2, Table 1, Table 2, Table 3.
  • Y. Tsai, X. Shen, Z. Lin, K. Sunkavalli, and M. Yang (2016) Sky is not the limit: semantic-aware sky replacement.. ACM Trans. Graph. 35 (4), pp. 149–1. Cited by: §1.
  • Z. Wu, C. Shen, and A. v. d. Hengel (2017) Real-time semantic image segmentation via spatial sparsity. arXiv preprint arXiv:1712.00213. Cited by: §4.1, §4.2, §4.3.2, Table 1, Table 2, Table 3.
  • C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Bisenet: bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 325–341. Cited by: §2.2, §2.2, §4.1, §4.2, §4.3.2, §4.3.3, Table 1, Table 2, Table 3, Table 4.
  • F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §1, §2.1.
  • H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018) Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7151–7160. Cited by: §1, §2.1.
  • H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia (2018a) Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 405–420. Cited by: §2.2, §2.2, §4.1, §4.2, §4.3.2, §4.3.3, §4.3.3, Table 1, Table 2, Table 3, Table 4.
  • H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2881–2890. Cited by: §1, §2.1, §3.1, §4.3.2, Table 1.
  • H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia (2018b) Psanet: point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 267–283. Cited by: §4.3.2.