Real-time Semantic Segmentation with Fast Attention

07/07/2020 ∙ by Ping Hu, et al. ∙ 0

In deep CNN based models for semantic segmentation, high accuracy relies on rich spatial context (large receptive fields) and fine spatial details (high resolution), both of which incur high computational costs. In this paper, we propose a novel architecture that addresses both challenges and achieves state-of-the-art performance for semantic segmentation of high-resolution images and videos in real-time. The proposed architecture relies on our fast spatial attention, which is a simple yet efficient modification of the popular self-attention mechanism and captures the same rich spatial context at a small fraction of the computational cost, by changing the order of operations. Moreover, to efficiently process high-resolution input, we apply an additional spatial reduction to intermediate feature stages of the network with minimal loss in accuracy thanks to the use of the fast attention module to fuse features. We validate our method with a series of experiments, and show that results on multiple datasets demonstrate superior performance with better accuracy and speed compared to existing approaches for real-time semantic segmentation. On Cityscapes, our network achieves 74.4% mIoU at 72 FPS and 75.5% mIoU at 58 FPS on a single Titan X GPU, which is ∼50% faster than the state-of-the-art while retaining the same accuracy.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Semantic segmentation is a fundamental task in robotic sensing and computer vision, aiming to predict dense semantic labels for given images 

[32, 13, 3, 31, 44, 33]. With the ability to extract scene contexts such as category, location, and shape of objects and stuff (everything else), semantic segmentation can be widely applied to many important applications like robots [21, 43, 46] and autonomous driving [9, 56, 30]. For many of these applications, efficiency is critical, especially in real-time (30FPS) scenarios. To achieve high accuracy semantic segmentation, previous methods rely on features enhanced with rich contextual cues [55, 4, 52, 11, 19] and high-resolution spatial details [54, 35]. However, rich contextual cues are typically captured via very deep networks with sizable receptive fields [55, 4, 52, 11] that require high computational costs; and detailed spatial information demand for inputs of high resolution [54, 35], which incur high FLOPs during inference.

Recent efforts have been made to accelerate models for real-time applications [29, 54, 35, 49, 28, 33]. These efforts can be roughly grouped into two types. The first strategy is to adopt compact and shallow model architectures [54, 35, 36, 49]. However, this approach may decrease the model capacity and limit the size of the receptive field for features, therefore decreasing the model’s discriminative ability. Another technique is to restrict the input to be low-resolution [36, 49, 28]. Though greatly decreasing the computational complexity, low-resolution images may lose important details like object boundaries or small objects. As a result, both types of methods sacrifice effectiveness for speed, limiting their practical applicability.

In this work, we address these challenges by proposing the Fast Attention Network (FANet) for real-time semantic segmentation. To capture rich spatial contextual information, we introduce an efficient fast attention module. The original self-attention mechanism has been shown to be beneficial for various vision tasks [47, 45] due to its ability to capture non-local context from the input feature maps. However, given channels, the original self-attention [47, 45] has a computational complexity of , which is quadratic with respect to the feature’s spatial size

. In the task of semantic segmentation, where high-resolution feature maps are required, this is costly and limits the model’s efficiency and applications to real-time scenarios. Instead, in our fast attention module, we replace the Softmax normalization used in self-attention with cosine similarity, thus converting the computation process to a series of matrix multiplication upon which the matrix-multiplication associativity can be applied to reduce the computational complexity to the linear

, without loss of spatial information. The proposed fast attention is times more efficient than the standard self-attention, given in semantic segmentation (e.g. =128256 and c=512) .

FANet works by first extracting different stages of feature maps, which are then enhanced by fast attention modules and finally merged from deep to shallow stages in a cascaded way for class label prediction. Moreover, to process high-resolution inputs at real-time speed, we apply additional spatial reduction into FANet. Rather than directly down-scaling the input images, which loses spatial details, we opt for down-sampling intermediate feature maps. This strategy not only reduces computations but also enables lower layers to learn to extract features from high-resolution spatial details, enhancing FANet effectiveness. As a result, with very low computational cost, FANet makes use of both rich contextual information and full-resolution spatial details. We conduct extensive experiments to validate our proposed approach, and the results on multiple datasets demonstrate that FANet can achieve the fastest speed with state-of-the-art accuracy when compared to previous approaches for real-time semantic segmentation. Furthermore, in pursuit of better performance for video streams, we generalize the fast attention module to spatial-temporal contexts, and show (in Sec. 4) that this has the same computational cost as the single-frame model and does not increase with the length of the temporal range. This allows us to add rich spatial-temporal context to video semantic segmentation while avoiding an increase in computation.

In summary, we contribute the following: (1)We introduce the fast attention module for non-local context aggregation for efficient semantic segmentation, and further generalize it to a spatial-temporal version for video semantic segmentation. (2) We empirically show that applying extra spatial reduction to intermediate feature stages of the network effectively decreases computational costs while enhancing the model rich spatial details. (3) We present a Fast Attention Network for real-time semantic segmentation of images and videos with state-of-the-art accuracy and much higher efficiency over previous approaches.

Ii Related Work

Extracting rich context information is key for high-quality semantic segmentation [27, 10, 39, 7]. To this end, dilated convolutions [5, 50] are proposed as an effective tool to enlarge receptive field without shrinking spatial resolution [24, 10]. DeepLab [4] and PSPNet [55] capture multi-scale spatial context. The encoder-decoder architecture is an effective way for extracting spatial context. Early works like SegNet [1] and U-net [41] adopt the symmetric structures for encoder and decoder. RefineNet [26] designs the multi-path refinement module to enhance the feature maps from deep to shallow. GCN [37, 53] explicitly refine predictions with large-kernel filters at different stages. Recently, Deeplab-v3+ [6] integrates dilated convolution and spatial pyramid pooling into an encoder-decoder network to further boost the effectiveness. The self-attention [47, 45] mechanism has been applied in semantic segmentation [23, 14] with the superior ability to capture long-range dependency, which, however, may incur intensive computation. To achieve better efficiency, Zhu et al. [58] propose to sample sparse anchor pixel locations for saving computation. Huang et al. [17]

only consider the pixels on the same column and row. Although these methods reduce computation, they all take an approximation of the self-attention model and only partially collect the spatial information. In contrast, our fast attention does not only greatly save computation, but also capture full information from the feature map without loss of spatial information.

We also notice that there are several works on bilinear feature pooling [51, 8] that are related to our fast attention. Yet, our work differentiates from them in three aspects. (1)  [51, 8] approximate the affinity between pixels, while our fast attention is derived in a strictly equivalent form to built accurate affinity. (2) Unlike [51, 8] that focus on recognition tasks, our fast attention effectively tackles the dense semantic segmentation task. (3) As we will show later, in contrast to  [51, 8], our fast attention allows for very efficient feature reuse in the video scenario, which can benefit video semantic segmentation with extra temporal context without increasing computation.

Existing methods for tackling video semantic segmentation can be grouped into two types. The first one [16, 25, 18, 42, 57, 38, 48, 22] takes advantage of the redundant information in video frames, and reduce computation by reusing the high-level feature computed at keyframes. These methods run very efficiently, while always struggling with the spatial misalignment between frames, which leads to a decreased accuracy. Differently, the other type of methods ignore the redundancy and focus on capturing the temporal context information from neighboring frames for better effectiveness [12, 20, 34], which, however, incurs extra computation to sharply decrease the efficiency. In contrast to these methods, our FANet can be easily extended to also aggregate temporal context and allow for efficient feature reuse, achieving both high effectiveness and efficiency.

Iii Fast Attention Network

In this section, we describe the Fast Attention Network (FANet) for real-time image semantic segmentation. We start by presenting the fast attention module and analyzing its computational advantages over original self-attention. Then we introduce the architecture of FANet. Last, we show that extra spatial reduction at intermediate feature stages of the model enables us to extract rich spatial details from high-resolution inputs while keeping a low computational cost.

Fig. 1: (a) Architecture of Fast Attention Network (FANet). (b) Structure of Fast Attention (FA). (c) Structure of “FuseUp” module. Distinct from channel attention(CA) which only aggregates feature along the channel dimension for each pixel independently, our Fast Attention aggregates contextual information over the spatial domain thus achieving better effectiveness

Iii-a Fast Attention Module

The self-attention module [47, 45] aims to capture non-local contextual information for each pixel location as a weighted sum of features at all positions in the feature map. Given a flattened input feature map where is the channel size and is the spatial size, the self-attention model [47, 45] applies 11 convolutions to encode the feature maps into a Value map that contains the semantic information of each pixel, a Query map together with a Key map that are used to build correlations between pixel positions. Then the self-attention is calculated as where is the Affinity operation modeling the pairwise relations between all spatial locations. The Softmax function is typically used to model affinity , resulting in the popular self-attention response [23, 58, 17],


Due to the existence of the normalization term in the Softmax function, the computation of Eq. (1) needs first compute the inner matrix multiplication , and then the one outside. This results in a computational complexity of . In semantic segmentation, feature maps have high spatial resolution. As complexity is quadratic with respect to the spatial size, this incurs high computational and memory costs, thus limiting the applications to scenarios especially those requiring real-time speed.

We tackle this challenge by first removing the Softmax affinity. As indicated in [47], there are a number of other affinity functions possible that can be used instead. One example, the dot product affinity can be computed simply as: . However, directly adopting the dot product may lead to affinity with unbounded values, and can be arbitrarily large. To avoid this, we instead use normalized cosine similarity for affinity computation,


where and are the results of and after L2-normalization along the channel dimension. Unlike Eq. 1, we observe that Eq. 2 can be represented as a series of matrix multiplications, which means that we can apply standard matrix-multiplication associativity to change the order of computation to achieve our fast attention as follows,


where is the spatial size, and is computed first.

Without loss of generality, this fast attention module can be computed with a computational complexity of , which is only about of the computational requirement of Eq. 1 (note that is typically much larger than in semantic segmentation). An illustration of fast attention module is shown in Fig. 1 (b). We noticed that channel attention (CA) [11] has similar computation to our FA, yet CA only aggregates feature along the channel dimension for each pixel, while our Fast Attention aggregates contextual information over the spatial domain thus being more effective.

Iii-B Network Architecture

We describe our architecture for image semantic segmentation Fig 1 (a). The network is an encoder-decoder architecture with three components: encoder (left), context aggregation (middle), and decoder (right). We use a light-weight backbone (ResNet18 [15] without last fully connected layers) as the encoder to extract features from the input at different semantic levels. Given an input with resolution , the first res-block (“Res-1”) in the encoder produces feature maps of resolution. The other blocks sequentially output feature maps with resolution downsampled by a factor of 2. Our network applies the fast attention modules at each stage. As shown in Fig. 1 (b), the fast attention module is composed of three convolutional layers for embedding the input features to be Query, Key, and Value maps respectively. When generating the Query and Key

, we remove the ReLU layer to allow for a wider range of correlation between pixels. The L2-normalization along the channel dimension makes sure the affinity is between -1 to +1. After the feature pyramid is processed by the fast attention modules, the decoder gradually merges and upsamples the features in a sequential fashion from deep feature maps to shallow ones. To enhance the decoded features with a high-level context, we further connect the middle features via a skip connection. An output with

resolution is predicted based on the enhanced feature output by the decoder.

Iii-C Extra Spatial Reduction for Real-time Speed

Being able to generate semantic segmentation for high resolution inputs efficiently is challenging. Typically, high-resolution inputs provide rich spatial details that help achieve better accuracy, but dramatically reduce efficiency [55, 4, 52, 37, 6]. On the other hand, using smaller input resolution saves computational costs, but generates worse results due to the loss of spatial details [36, 49, 28].

To alleviate this, we adopt a simple yet effective strategy, which is to apply additional down-sample operations to the intermediate feature stages of the network rather than directly down-sampling the input images. We conduct an additional experiment where we use different types of spatial reduction operations, such as pooling and strided convolution at different feature stages, and evaluate how this impacts the resulting quality and speed trade-off. When applying an extra spatial reduction operator to our model, a similar up-sampling operation is added to the same stage of the decoder to keep the output resolution. We select the best choice of these, which we show in Section 

IV-C, not only reduces computation for upper layers, but also allows lower layers to learn to extract rich spatial details from high-resolution inputs and enhance performance. Thus allowing for both real-time efficiency and effectiveness with full-resolution input.

Iii-D Extending to Video Semantic Segmentation

In many real-world applications of semantic segmentation, such as self-driving and robotics, video streams are the natural input for vision systems to understand the physical world. Nevertheless, most existing approaches for semantic segmentation focus on processing static images, and pay less attention to video data. In addition to spatial context from individual frames, video sequences also contain important temporal context derived from dynamics in the camera and scene. To take advantage of such temporal context for better accuracy, in this section we extend our fast attention module to spatial-temporal contexts, and show that it improves video semantic segmentation without increasing computational costs.

Given extracted from the target frame , and with from the previous frames respectively, the spatial-temporal context within such a -frame window can be aggregated via the traditional self-attention [47] as,


This has a computational complexity of , times higher than the single-frame spatial attention in Eq. 1.

By replacing the original self-attention with our fast attention, the spatial-temporal context for the target frame can be computed as


where is the spatial size, and indicate the L2-normalized and respectively. At time step , the results for with have already been computed and simply can be reused. We can see in Eq. 6, that we only need to compute and store the term , add it to those of the previous frames’ (this matrix addition’s cost is negligible), and multiply it by . Therefore, given a -frame window our spatial-temporal fast attention has a computational complexity of , which is as efficient as the single-frame fast attention, and free of . Therefore, our fast attention is able to aggregate spatial-temporal context without increasing computational cost. An illustration of the spatial-temporal FA is shown in Fig. 2. By replacing the fast attention modules with this spatial-temporal version, FANet is able to sequentially segment video frames with feature enhanced with spatial-temporal context.

Fig. 2: Visualization of our fast attention for spatial-temporal context aggregation (=2). The red arrows indicate the feature stored and reused by future frames.

Iv Experiments

Iv-a Datasets and Evaluation


is a large benchmark containing 19 semantic classes for urban scene understanding with 2975/500/1525 scenes for train/validation/test respectively.

CamVid[2] is another street-view dataset with 11 classes. The annotated frames are divided into 367/101/233 for training/validation/testing. COCO-Stuff[3] contains both diverse indoor and outdoor scenes for semantic segmentation. This dataset has 9,000 densely annotated images for training and 1,000 for testing. Following previous work[54], we adopt the resolution 640640 and evaluate on 182 classes including 91 for things and 91 for stuff. We evaluate our method on image semantic segmentation for all four datasets, and additionally evaluate on Cityscapes for video semantic segmentation. The mIoU (mean Intersection over Union) is reported for evaluation.

Iv-B Implementation Details

We use ResNet-18/34 [15]

pretrained on Imagenet as the encoder in FANet, and randomly initialize parameters in fast attention modules as well as the decoder network. We train using mini-batch stochastic gradient descent (SGD) with batchsize 16, weight decay 5e

, and momentum 0.9. The learning rate is initialized as 1e, and multiplied with after each iteration. We apply data augmentation including random horizontal flipping, random scaling (from 0.75 to 2), random cropping and color jittering in the training process. During testing, we input images at full resolution, and resize the output to the original size for calculating the accuracy. All the evaluation experiments are conducted with batchsize 1 on a single Titan X GPU.

C=   32   64  128  256  512  1024
Self-Att.[45]   68  103  173  313  602  1203
Ours   0.2   0.6  1.7   5   19    73
TABLE I: GFLOPs for non-local module [47] and our fast attention module with C128256 features as input.

Iv-C Method Analysis

Fast Attention. We first show the advantage in efficiency due to our fast attention. In Table I, we compare GFLOPs between a single original self-attention module and our fast attention module. Note that our fast attention runs significantly more efficiently for different size input features with more than 94 less computation.

We also compare our fast attention to the original self-attention module [45] in our FANet. As shown in Table III, compared to the model without attention (denoted as “w/o Att.”), applying the original self-attention module to the network increases mIoU by 2.4 while decreasing the speed from 83 fps to 8 fps. In contrast to the original self-attention module, our fast attention (denoted as “FA with L2-norm”) can achieve only slightly worse quality performance while greatly saving the computation cost. To further analyze our cosine-similarity based fast attention, we also train without the L2-normalization for both Query and Key features (denoted as “FA w/o L2-norm”) and achieve 74.1 mIoU on the Cityscapes , which is lower than 75.0 mIoU of our full model. This validates the necessity of cosine similarity to ensure bounded values for affinity computation.

Channel (’)  8  16  32  64  128
mIoU () 73.5 74.6 75.0 75.0 75.0
Speed (fps)  74  74  72  69  65
TABLE II: Performance on Cityscapes val for different channel numbers (’) in fast attention in FANet-18.

In Table II, we analyze the influence of channel numbers for Key and Query maps in our fast attention module. As we can see, too few channels such as ’=8 or ’=16 saves computation, but limits the representing capacity of the feature and leading to lower accuracy. On the other hand, when increasing the channel number from 32 to 128, the accuracy becomes stable, yet the speed drops. As a result, we adopt ’=32 in our experiments.

mIoU(%) Speed(fps) GFLOPs
w/o Att.   72.7    83   48
Self-Att.[45]   75.1    8   121
Channel-Att.[11]   74.6   70   51
FA w/o L2-norm   74.1    72   49
FA with L2-norm   75.0    72   49
TABLE III: Performance on Cityscapes for different attention mechanisms for FANet-18. “FA” denotes our fast attention.

Spatial Reduction Next, we analyze the effect of applying the extra spatial reduction at different feature stages of FANet. The effects of additionally down-sampling different blocks are presented in Fig. 3. As we can see, down-sampling before “Conv-0” (down-scaling the input image), reduces the computation of all the subsequent layers, but loses critical spatial details which reduces the result quality. “Res-1” indicates that we reduce the spatial size at the stage of the first Res-block in FANet. Extra spatial reduction at higher stages like “Res-2”, “Res-3”, and “Res-4” do not increase speed significantly. Interestingly enough, we observe that applying down-sampling to “Res-4” actually performs better than “None”, no additional downsampling. We hypothesize that this may be because that the block “Res-4” process high-level features, and adding extra down-sampling helps to enlarge the receptive field thus benefiting with rich contextual information. Based on these observations and with an aim of real-time semantic segmentation, we choose to apply extra down-sampling to “Res-1” and denote the model as FANet-18/34 based on the ResNet encoder used.

Fig. 3: Accuracy and speed analysis on Cityscapes val for adding an additional down-sampling operation (rate=2) to different stages of the encoder in FANet. “Conv-0” means to directly down-sample the input image. “Res-” indicates double the stride of the first Conv layer in the -th Res-block. “None” means no additional down-sampling operation is applied.

In additional to doubling the stride of convolutional layers to achieve 75.0 mIoU, we also experiment with other forms of down-sampling including average pooling (72.9

mIoU) and max pooling (74.2

mIoU). Enlarging stride for Conv layers performs the best. This may be because that stride convolution helps to capture more spatial details while keeping sizable receptive fields.

Iv-D Image Semantic Segmentation

We compare our final method to the recent state-of-the-art efficient approaches for real-time semantic segmentation. For fair comparisons, we evaluate the speed for different methods with PyTorch on the same Titan X GPU. Please check our supplementary material for details. On benchmarks including Cityscapes 

[9], CamVid [2], and COCO-Stuff [3], our FANet achieves accuracy comparable to the state-of-the-art with the highest efficiency.

   Methods       mIoU()   Speed(fps) GFLOPs GFLOPs@1Mpx    Input
  val   test Resolution
  SegNet [1]     –   56.1        36     143       650 360640
  ICNet [54]   67.7   69.5        38      30        15 10242048
  ERFNet [40]   71.5   69.7        48     103       206 5121024
  BiseNet [49]   74.8   74.7        47      67       59.5 7681536
  ShelfNet [59]     –   74.8        39      95       47.5 10242048
  SwiftNet [35]   75.4   75.5        40     106        53 10242048
  FANet-34   76.3   75.5        58      65       32.5 10242048
  FANet-18   75.0   74.4        72      49       24.5 10242048
TABLE IV: Image semantic segmentation performance comparison with recent state-of-the-art real-time methods on Cityscapes dataset. “GFLOPs@1Mpx” shows the GFLOPs for input with resolution 1M pixels.

Cityscapes. In Table IV, we present the speed-accuracy comparison. FANet-34 achieves mIoU 76.3 for validation and 75.5 for testing at a speed of 58 fps with full-resolution (10242048) inputs. To our best knowledge, FANet-34 outperforms existing approaches for real-time semantic segmentation with better speed and state-of-the-art accuracy. By adopting a lighter-weight encoder ResNet-18, our FANet-18 further accelerates the speed to 72 fps, which is nearly two times faster than the recent methods like ShelfNet [59] and SwiftNet [35]. Although the accuracy drops to mIoU 75.0 for validation and 74.4 for testing, it is still much better than many previous methods like SegNet [1] and ICNet [54], and comparable to the most recent methods like BiseNet [49] and ShelfNet [59]. The performance achieved by our models demonstrates the superior ability to better balance the accuracy and speed for real-time semantic segmentation. Some visual results of our method are shown in Fig. 4.

CamVid. Results for this dataset are reported in Table V. As we can see, our FANet outperforms previous methods with better accuracy and much faster speed. Comparing to BiseNet [49], our FANet-18 runs efficient, and our FANet-34 outperforms with mIoU and a faster speed.

COCO-Stuff. To be consistent with previous methods [54], we evaluate at resolution 640640 for segmenting the 182 categories. As shown in Table V, for the general scene understanding task with this dataset, our FANet is also able to achieve satisfying accuracy with much faster speed than previous methods. Compared to the state-of-the-art real-time model ICNet [54], our FANet-34 achieves both better accuracy and speed, and FANet-18 can further accelerate the speed with a comparable mIoU.

Fig. 4: Image semantic segmentation results on Cityscapes.
Method mIoU Speed (%) (fps) SegNet [1]  55.6 12 ENet [36]  51.3 46 ICNet [54]  67.1 82 BiseNet [49]  68.7 75 FANet-34  70.1 121 FANet-18  69.0 154 Method mIoU Speed (%) (fps) FCN [27]  22.7   9 DeepLab [4]  26.9  14 ICNet [54]  29.1 110 BiseNet [49]  25.6 113 FANet-34  29.5 142 FANet-18  27.8 191
TABLE V: Image semantic segmentation performance on Camvid (left) and COCO-Stuff (right).
Method mIoU Speed Avg RT MaxLatency
() (fps) (ms) (ms)
DVSNet-fast[48] 63.2 30.4 33 -
Clockwork[42] 64.4 5.6 177 221
DFF[57] 69.2 5.7 175 644
Accel[18] 72.1 2.9 340 575
Low-Latency[25] 75.9 7.5 133 133
Netwarp[12] 80.6 0.33 3004 3004
FANet34 76.3 58 17 17
FANet34+Temp 76.7 58 17 17
FANet18 75.0 72 14 14
FANet18+Temp 75.5 72 14 14
TABLE VI: Video semantic segmentation on Cityscapes. “+Temp” indicates FANet with spatial-temporal attention (t=2). Avg RT is the average per-frame running time, and MaxLatency is the maximum per-frame running time.

Iv-E Video Semantic Segmentation

In this part, we evaluate our method for video semantic segmentation on the challenging dataset Cityscapes [9]. Without significantly increasing the computational cost, our method can effectively capture both spatial and temporal contextual information to achieve better accuracy, and outperforms previous methods with much lower latency. In Table VI, we compare our method with recent state-of-the-art approaches for video semantic segmentation. Compared to the image segmentation baseline models FANet18 and FANet34, both our spatial-temporal version FANet18+Temp and FANet34+Temp help to improve the accuracy at the same computational costs. We also see that most of the existing methods fail to achieve real-time speed ( 30fps), apart from DVSNet which has much lower accuracy than ours. Methods like Clockwork [42] and DFF [57] save the overall computation while suffering from high latency due to the heavy computation at keyframes. PEARL [20] and Networp [12] achieves state-of-the-art accuracy at the cost of very low speed and high latency. In contrast, FANet18+Temp and FANet34+Temp achieve state-of-the-art accuracy with a much faster speed. FANet18+Temp achieves more than 200 better efficiency than Netwarp [12]. FANet34+Temp outperforms PEARL [20] with 40 faster speed.

V Conclusion

We have proposed a novel Fast Attention Network for real-time semantic segmentation. In the network, we introduce fast attention to efficiently capture contextual information from feature maps. We further extend the fast attention to spatial-temporal context, and apply our models to achieve low-latency video semantic segmentation. To ensure high-resolution input with high efficiency, we also propose to apply spatial reduction to the intermediate feature stages. As a result, our model is enhanced with both rich contextual information and high-resolution details, while keeping a real-time speed. Extensive experiments on multiple datasets demonstrate the efficiency and effectiveness of our method.


  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. on PAMI. Cited by: §II, §IV-D, TABLE IV, TABLE V.
  • [2] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla (2008) Segmentation and recognition using structure from motion point clouds. In ECCV, Cited by: §IV-A, §IV-D.
  • [3] H. Caesar, J. Uijlings, and V. Ferrari (2018) Coco-stuff: thing and stuff classes in context. In CVPR, Cited by: §I, §IV-A, §IV-D.
  • [4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. on PAMI. Cited by: §I, §II, §III-C, TABLE V.
  • [5] L. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille (2016) Attention to scale: scale-aware semantic image segmentation. In CVPR, Cited by: §II.
  • [6] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §II, §III-C.
  • [7] W. Chen, X. Gong, X. Liu, Q. Zhang, Y. Li, and Z. Wang (2019) FasterSeg: searching for faster real-time semantic segmentation. In ICLR, Cited by: §II.
  • [8] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng (2018) A^ 2-nets: double attention networks. In NIPS, Cited by: §II.
  • [9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: §I, §IV-A, §IV-D, §IV-E.
  • [10] H. Ding, X. Jiang, B. Shuai, A. Qun Liu, and G. Wang (2018) Context contrasted feature and gated multi-scale aggregation for scene segmentation. In CVPR, Cited by: §II.
  • [11] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu (2019) Dual attention network for scene segmentation. In CVPR, Cited by: §I, §III-A, TABLE III.
  • [12] R. Gadde, V. Jampani, and P. V. Gehler (2017) Semantic video cnns through representation warping. In CVPR, Cited by: §II, §IV-E, TABLE VI.
  • [13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research. Cited by: §I.
  • [14] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao (2019) Adaptive pyramid context network for semantic segmentation. In CVPR, Cited by: §II.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §III-B, §IV-B.
  • [16] P. Hu, F. Caba, O. Wang, Z. Lin, S. Sclaroff, and F. Perazzi (2020) Temporally distributed networks for fast video semantic segmentation. In CVPR, Cited by: §II.
  • [17] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu (2019) CCNet: criss-cross attention for semantic segmentation. In ICCV, Cited by: §II, §III-A.
  • [18] S. Jain, X. Wang, and J. E. Gonzalez (2019) Accel: a corrective fusion network for efficient semantic segmentation on video. In CVPR, Cited by: §II, TABLE VI.
  • [19] W. Jiang, Y. Wu, L. Guan, and J. Zhao (2019) DFNet: semantic segmentation on panoramic images with dynamic loss weights and residual fusion block. In ICRA, Cited by: §I.
  • [20] X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y. Chen, J. Dong, L. Liu, Z. Jie, et al. (2017) Video scene parsing with predictive feature learning. In ICCV, Cited by: §II, §IV-E.
  • [21] I. Kostavelis and A. Gasteratos (2015) Semantic mapping for mobile robotics tasks: a survey. Robotics and Autonomous Systems. Cited by: §I.
  • [22] I. Krešo, J. Krapac, and S. Šegvić (2020) Efficient ladder-style densenets for semantic segmentation of large images. IEEE Trans. on ITS. Cited by: §II.
  • [23] X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu (2019) Expectation-maximization attention networks for semantic segmentation. In ICCV, Cited by: §II, §III-A.
  • [24] X. Li, Z. Liu, P. Luo, C. Change Loy, and X. Tang (2017) Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. In CVPR, Cited by: §II.
  • [25] Y. Li, J. Shi, and D. Lin (2018) Low-latency video semantic segmentation. In CVPR, Cited by: §II, TABLE VI.
  • [26] G. Lin, A. Milan, C. Shen, and I. Reid (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In CVPR, Cited by: §II.
  • [27] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §II, TABLE V.
  • [28] D. Marin, Z. He, P. Vajda, P. Chatterjee, S. Tsai, F. Yang, and Y. Boykov (2019) Efficient segmentation: learning downsampling near semantic boundaries. In ICCV, Cited by: §I, §III-C.
  • [29] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi (2018) Espnet: efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, Cited by: §I.
  • [30] A. Meyer, N. O. Salscheider, P. F. Orzechowski, and C. Stiller (2018) Deep semantic lane segmentation for mapless driving. In IROS, Cited by: §I.
  • [31] A. Milan, T. Pham, K. Vijay, D. Morrison, A. W. Tow, L. Liu, J. Erskine, R. Grinover, A. Gurman, T. Hunn, et al. (2018) Semantic segmentation from limited training data. In ICRA, Cited by: §I.
  • [32] R. Mottaghi, X. Chen, X. Liu, N. Cho, S. Lee, S. Fidler, R. Urtasun, and A. Yuille (2014) The role of context for object detection and semantic segmentation in the wild. In CVPR, Cited by: §I.
  • [33] V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen, and I. Reid (2019)

    Real-time joint semantic segmentation and depth estimation using asymmetric annotations

    In ICRA, Cited by: §I, §I.
  • [34] D. Nilsson and C. Sminchisescu (2018) Semantic video segmentation by gated recurrent flow propagation. In CVPR, Cited by: §II.
  • [35] M. Orsic, I. Kreso, P. Bevandic, and S. Segvic (2019) In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. In CVPR, Cited by: §I, §I, §IV-D, TABLE IV.
  • [36] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016)

    Enet: a deep neural network architecture for real-time semantic segmentation

    arXiv preprint arXiv:1606.02147. Cited by: §I, §III-C, TABLE V.
  • [37] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun (2017) Large kernel matters – improve semantic segmentation by global convolutional network. In CVPR, Cited by: §II, §III-C.
  • [38] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe (2017) Full-resolution residual networks for semantic segmentation in street scenes. In CVPR, Cited by: §II.
  • [39] P. Purkait, C. Zach, and I. Reid (2019) Seeing behind things: extending semantic segmentation to occluded regions. In IROS, Cited by: §II.
  • [40] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo (2017) Efficient convnet for real-time semantic segmentation. In IEEE Intelligent Vehicles Symposium, Cited by: TABLE IV.
  • [41] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §II.
  • [42] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell (2016) Clockwork convnets for video semantic segmentation. In ECCV, Cited by: §II, §IV-E, TABLE VI.
  • [43] E. Stenborg, C. Toft, and L. Hammarstrand (2018) Long-term visual localization using semantically segmented images. In ICRA, Cited by: §I.
  • [44] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers (2018) Normalized cut loss for weakly-supervised CNN segmentation. In CVPR, Cited by: §I.
  • [45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §I, §II, §III-A, §IV-C, TABLE I, TABLE III.
  • [46] K. Wada, K. Okada, and M. Inaba (2019) Joint learning of instance and semantic segmentation for robotic pick-and-place with heavy occlusions in clutter. In ICRA, Cited by: §I.
  • [47] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, Cited by: §I, §II, §III-A, §III-A, §III-D, TABLE I.
  • [48] Y. Xu, T. Fu, H. Yang, and C. Lee (2018) Dynamic video segmentation network. In CVPR, Cited by: §II, TABLE VI.
  • [49] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Bisenet: bilateral segmentation network for real-time semantic segmentation. In ECCV, Cited by: §I, §III-C, §IV-D, §IV-D, TABLE IV, TABLE V, TABLE V.
  • [50] F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. ICLR. Cited by: §II.
  • [51] K. Yue, M. Sun, Y. Yuan, F. Zhou, E. Ding, and F. Xu (2018) Compact generalized non-local network. In NIPS, Cited by: §II.
  • [52] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal (2018) Context encoding for semantic segmentation. In CVPR, Cited by: §I, §III-C.
  • [53] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun (2018) ExFuse: enhancing feature fusion for semantic segmentation. In ECCV, Cited by: §II.
  • [54] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia (2018) Icnet for real-time semantic segmentation on high-resolution images. In ECCV, Cited by: §I, §I, §IV-A, §IV-D, §IV-D, TABLE IV, TABLE V, TABLE V.
  • [55] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, Cited by: §I, §II, §III-C.
  • [56] W. Zhou, S. Worrall, A. Zyner, and E. Nebot (2018) Automated process for incorporating drivable path into real-time semantic segmentation. In ICRA, Cited by: §I.
  • [57] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei (2017) Deep feature flow for video recognition. In CVPR, Cited by: §II, §IV-E, TABLE VI.
  • [58] Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai (2019) Asymmetric non-local neural networks for semantic segmentation. In ICCV, Cited by: §II, §III-A.
  • [59] J. Zhuang, J. Yang, L. Gu, and N. Dvornek (2019) ShelfNet for fast semantic segmentation. In ICCV Workshops, Cited by: §IV-D, TABLE IV.