Semantic segmentation can be viewed as a task of pixel-wise classification, which assigns a specific pre-defined category to each pixel in an image. The task has many potential applications in autonomous driving or image editing and so on. Although many works [11, 6, 9] have made great progress in the accuracy of image semantic segmentation tasks, their network size, inference speed, computation and memory cost limit their practical applications. Therefore, it’s essential to develop the light-weighted, efficient and real-time methods for semantic segmentation.
and Caffe, including ICNet, CGNet , ERFNet , SegNet , ENet  and LEDNet .
Among these properties, light-weight could be the most essential one, because using a smaller scale network can lead to faster speed and more efficient in computation or memory cost easier. A less learnable network parameter means that the network has less redundancy and network structure makes its parameters more effective. Practically, a smaller model size is more favorable for cellphone apps. Designing blocks with proper factorized convolution to construct an effective light-weighted network could be a more scalable way and could be easier to balance among accuracy, network size, speed, and efficiency. In this way, many works [12, 19, 15, 17, 10] achieve promising results and show their potential for many potential applications. But those works [15, 17] do not balance factorized convolution and long-range features in a proper way.
The recent study 
shows the powerful potential of attention mechanism in computer vision. Non-local methods are employed to model long-range dependencies in semantic segmentation. However, modeling relationships between every position could be rather heavy in computation and memory cost. Some works try to develop factorized  or reduced 
non-local methods to make it more efficient. Since efficient non-local or positional attention is not developed enough for light-weighted and efficient semantic segmentation, our approach tries to develop a powerful reduced non-local method to model long-range dependencies and global feature selection efficiently.
In our work, we develop a light-weighted factorized convolution block (FCB) (Fig. 4) to build a feature extraction network (encoder), which deals with long-range and short-range features with proper factorized convolution respectively, and we proposed a powerful reduced non-local module with regional singular vectors to model long-range dependencies and global feature selection for the features from encoder to enhance the segmentation results. Contributions are summarized as follows:
We proposed a factorized convolution block (FCB) to build a very light-weighted, powerful and efficient feature extraction network by dealing with long-range and short-range features in a more proper way.
The proposed efficient reduced non-local module (SVN) utilizes regional singular vectors to produced more reduced and representative features to model long-range dependencies and global feature selection.
2 Related Work
Light-weighted and Real-time Segmentation. Real-time semantic segmentation approaches aim to generate high-quality prediction in limited time, which is usually performed under resource constraints or mobile applications. Light-weight models save storage space and potentially have lower computation and faster speed. Therefore, developing light-weight segmentation is a potential way to get a good trade-off for real-time semantic segmentation [18, 17, 15]. Our model follows the light-weight style to achieve real-time segmentation.
Factorized Convolution. Standard convolution adopts a 2D convolution kernel to form a full connection between input and output channels, which learns local relation and channel interaction. However, this may suffer from the large parameter size and redundancy for real-time tasks under resource constraints. Xception  and MobileNet  adopt depthwise separable convolution, which consists of a depthwise convolution followed by a pointwise convolution. Depthwise convolution learns local relation in every channel and pointwise convolution learns the interaction between channels to reduce parameters and computation. ShuffleNet  adopts a split-shuffle strategy to reduce parameters and computation. In this strategy, standard convolution is split into some groups of channels and a channel shuffle operation helps the information flows between groups. Factorizing the 2D convolution kernel into a combination of two 1D convolution kernels is another way to reduce parameter size and computation cost. Many light-weight approaches [17, 15] take this way and get promising performances. In this paper, our convolution factorization block (FCB) utilizes these strategies to build a light-weighted, efficient and powerful structure.
Attention Models. Attention modules model long-range dependencies and have been applied in many computer vision tasks. Position attention and channel attention are two important mechanisms. Channel attention modules are widely applied in semantic segmentation  including some light-weighted approaches [18, 10]. Position attention or non-local methods have a higher computational complexity. Although some works [9, 8, 22] try to develop more efficient non-local methods, position attention or non-local methods are rarely explored in light-weighted semantic segmentation.
We introduce the preliminary related to our SVN module in section 3.1, network architecture in section 3.2, the proposed FCB unit in section 3.3 and SVN module in section 3.4.
Before introducing the proposed method, we first introduce the singular value decomposition and non-local method, which are related to our SVN module in section 3.4.
Singular Value Decomposition and Approximation. Given a real matrix , with real numbers , there exist two orthogonal matrices and , satisfying Equation 1,
where , and . If we choose , we can get
where approximates the original , because the larger singular values and their singular vectors keep most of the information of . The corresponding singular vectors of the larger singular value contain more information of the matrix, especially the dominant singular vectors. We can calculate the dominant singular vectors by power iteration Algorithm 1 efficiently. Based on Equation 1, rotating columns of does not change and and their singular values.
Non-local Module. Non-local module  models global feature relationships. We illustrate it in the form of Query-Key-Value. It can be formulated as:
where , is a Query, is the corresponding output of , , is a Key, , is a Value, is the measure of similarity between and , and is a normalization function, , and are the collections of the Queries, the Keys and the Values, respectively. And a smaller means less computation.
3.2 Overview of the Network Architecture
In this section, we introduce our network architecture. Our LRNNet consists of a feature extraction network constructed by our proposed factorized convolution block (Fig. 4(c)) and a pixel classifier enhanced by our SVN module (Fig. 2).
We form our encoder in a three-stages ResNet-style ( Fig. 2(a)). We adopt the same transition between stages as ENet  using a downsampling unit. The core components are our FCB units, which provide light-weighted and efficient feature extraction. For better comparison of other light-weight factorized convolution block, we adopt the same dilation series of FCB in encoder as LEDNet  after the last downsampling unit (details in supplemental material). Our decoder ( Fig. 2(b)) is a pixel-wise classifier enhanced by our SVN module.
|image||ground truth||SS-nbt ||Model A||Model B||Model C|
3.3 Factorized Convolution Block
Designing a factorized convolution block is a popular way to achieve light-weighted segmentation. Techniques like dilated convolution for enlarging receptive field are also important for semantic segmentation models. Our factorized convolution block is inspired by the observation that 1D factorized kernel could be more suitable for spatially less informative features than the spatially informative features. Consider the situation of a convolution kernel is replaced by a convolution kernel followed by a
convolution kernel, which could have the same receptive field and fewer parameters. However, neglecting the information lost of crossing the activate function between the two 1D convolution kernel, it could be a rank-1 approximation for theconvolution kernel. Assuming that different spatial semantic regions have different features, if the dilation of convolution kernel is one or small, the kernel may not lay across multiple different spatial semantic regions and the receptive features are less informative and simple so that the rank-1 approximation is more likely to be effective, and vice versa.
Therefore, the convolution kernel with large dilation will receive complex or spatially informative long-range features (features separated with a large dilation) in space, and it needs more parameters in space. Meanwhile, a convolution kernel with small dilation will receive simple or less informative short-range features in space, and fewer parameters in space are enough. Our FCB (Fig. 4(c)
) first deals with short-range and spatially less informative features with 1D factorized convolution in two split groups, which is fully connected in channel, so the factorized convolution reduces the parameter and computation a lot. To enlarge the receptive field, our FCB utilizes 2D kernel with larger dilation and use depthwise separable convolution to reduce the parameter and computation. A channel shuffle operation is set at last because there is a residual connection after the point-wise convolution. In total, FCB uses a low-rank approximation (1D kernel) in space for short-range features and depth-wise spatial 2D dilated kernel for long-range features, which lead to more light-weight, efficient and powerful feature extraction.
Compared with other factorized convolution blocks (Fig. 4), our FCB has a more elaborate design, fewer parameters, less computation, and faster speed, which will be shown in the experiment part further.
3.4 SVN Module
A light-weighted model can hardly achieve powerful feature extraction as a big network. Therefore, to produce reduced, robust and representative features and combine them into non-local modules is an essential way to explore the efficient non-local mechanism for light-weighted semantic segmentation. We revisit the non-local mechanism in the form of Query-Key-Value and claim that using the reduced and representative features as the Keys and the Values could reduce computation and memory, as well as maintain effectiveness.
Our SVN module is presented in Fig. 2(b). We reduced the cost in two ways, which are forming a bottleneck by Conv1 and Conv2 to reduced channels for non-local operation and replacing the Keys and Values by their regional dominant singular vectors. The proposed SVN consists of two branches. The lower branch is a residual connection from the input. The upper branch is the bottleneck of our reduced non-local operation. In the bottleneck, we divide the feature maps into spatial sub-regions. We divide feature maps into () spatial sub-regions with a scale of . For each sub-region, we flatten it into a matrix, then use the Power Iteration Algorithm 1 to calculate their left dominant singular vectors (
) efficiently. As is mentioned in Sec 3.1, rotating columns does not affect the left orthogonal matrix, so the left dominant singular vector is agnostic to the way of flattening and this property is similar to pooling. Then the regional dominant singular vectors are used as theand for the non-local operation, where a smaller S means less computation, and the are positional vectors () from the feature maps before dominant singular vector extraction. To enhance the reduced non-local module, we also perform multi-scale region extraction and gather dominant singular vectors from different scales as the Keys and Values (see Fig. 2(b) and Equation 4).
where is the output of SVN, s are collections of regional dominant singular vectors from their related scales, the regional dominant vectors are used as both the Keys () and Values (), is a Query from feature maps before dominant singular vectors extraction, and our SVN uses dot product.
As is illustrated above, the SVN module forms a reduced and effective non-local operation by bottleneck structure and reduced and representative regional dominant singular vectors. The regional dominant singular vector could be the most representative for a region of feature maps. Since some works  utilize pooling as the Keys and Values, we will compare the pooling, single-scale and multi-scale region extraction in our structure in the ablation experiments.
We conduct experiments to demonstrate the performance of our FCB block and SVN module and the state-of-the-art trade-off among light-weight, accuracy and efficiency of our proposed segmentation architecture. For ablation, we denote our LRNNet without SVN, with single-scale SVN and multi-scale SVN as Model A, Model B and Model C, respectively.
4.1 Datasets and Settings
The Cityscapes dataset  consists of high-quality pixel-level annotations of 5000 street scenes 2048 1024 images and there are 2975, 500 and 1525 images in the training set, validation set and test set respectively. Following the light-weighted approaches [17, 15], we adopt 5121024 subsampled image for testing. The CamVid dataset contains 367 training, 101 validating and 233 testing images with a resolution of 960720, but we follow the setting as [12, 1] using 480360 images for training and testing.
We implement all our methods using Pytorch  on a single GTX 1080Ti. Following , we employ a poly learning rate policy and the base learning rate is 0.01. The batch size is set to 8 for all training. For CamVid testing and evaluation on Cityscapes validation set, we take 250k iterations for training to study our network quickly. And we only train our model on fine annotations for Cityscapes test set with 520K iterations.
4.2 Ablation Study for FCB
Comparing with other factorized convolution blocks shown in Figure 4, ERFNet  and LEDNet  simply use 1D factorized kernel to deal with short-range and long-range (with dilation) features. As is analyzed in Section 3.3, our FCB deals with short-range features with 1D factorized kernel and long-range features with the 2D depth-wise kernel. We compare our FCB (Model A) with SS-nbt from LEDNet  in the same architecture. As shown in Table 1 and 4, our FCB (Model A) achieves better accuracy with lower parameter size, computation and inference time comparing with SS-nbt which using 1D factorized kernel for both short-range and long-range features. Visual examples are in Fig. 3.
4.3 Ablation Study for SVN
Table 2 shows the performances of different sub-region choices of our SVN module and the standard non-local. Balancing accuracy, speed and computation cost, we choose 64 (8) sub-region as single-scale SVN (Model B) and 88+44 sub-regions for multi-scale SVN (Model C).
We analyze the efficiency of our SVN. Since Algorithm 1 converges efficiently, we set the T as 2, whose computation complexity is . The features in our bottleneck on Cityscapes is , and the computation is 4.0 GFLOPS in standard non-local operation neglecting the convolution and the complexity is . For our reduced non-local operation, the complexity is , where S () is the number of the Keys and Values. For single-scale SVN (Model B) and pooling, we divide feature maps into 64 sub-regions and the computation is 32 MFLOPS. For multi-scale SVN, feature maps into 64 and 16 sub-regions, and the computation is 40 MFLOPS. The cost of Power Iteration in single-scale and multi-scale SVN are 1 MFLOPS and 2 MFLOPS, respectively.
We compare using regional dominant singular vectors (Model B) with using pooling features in single scale () (Table 1) to show the effectiveness of dominant singular vectors. Results on Cityscapes validation are shown in Table 1. Comparison of Model single-scale SVN (Model B) and pooling (max or average) shows that regional singular vectors are effective for our network with a light-weighted encoder with 0.5 mIoU improvement and additional 0.09 GFLOPS, while using pooling can not provide representative features for a light-weighted network. And the multi-scale SVN (Model C) further improves the result to 71.4% mIoU with a little cost on inference time and computation.
4.4 Comparison with Other Methods
We compare our LRNNet with other light-weighted methods on Cityscapes and Camvid test sets in terms of parameter size, accuracy, speed and computation. We only report the speed of methods on the open-source deep-learning framework, such as Pytorch, TensorFlow and Caffe, because they have comparable implemented performance, but have a large gap comparing with the non open-source deep-learning framework of those works[10, 19] and details are in supplemental material. Results are shown in Table 3 and 4. ”-” indicates that the speed is not achieved by open-source deep learning frameworks or not provided. Our network constructed by FCB (Model A) achieves 70.6% mIoU and 76.5 FPS on Cityscapes test set with only 0.67M parameters and 8.48 GFLOPS, which is more light-weighted and efficient than ERFNet  and LEDNet  with better accuracy. With single-scale (Model B) and multi-scale SVN (Model C), LRNNet achieves 71.6% and 72.2% mIoU on Cityscapes test set with a little cost on speed and efficiency, respectively. Our LRNNet with multi-scale SVN achieves 69.2% mIoU with only 0.68M parameters on CamVid test set. All results show the state-of-the-art trade-off among parameter size, speed, computation and accuracy of our LRNNet. Visual comparison can be viewed in Fig. 1.
We have proposed LRNNet for real-time semantic segmentation. The proposed FCB unit explores a proper form of factorized convolution block to deal with short-range and long-range features, which provides light-weighted, efficient and powerful feature extraction for the encoder of our LRNNet. Our SVN module utilizes regional dominant singular vectors to construct the efficient reduced non-local operation, which enhances the decoder with a very low cost. Extensive experimental results have validated our state-of-the-art trade-off in terms of parameter size, speed, computation and accuracy.
This paper is supported by NSFC (No.61772330, 61533012, 61876109), the pre-research project (No.61403120201), Shanghai Key Laboratory of Crime Scene Evidence (2017XCWZK01) and the Interdisciplinary Program of Shanghai Jiao Tong University (YG2019QNA09).
-  (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE TPAMI 39 (12), pp. 2481–2495. Cited by: Figure 1, §4.1, Table 3, Table 4.
-  (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: 3rd item, §4.1.
-  (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. External Links: Cited by: §4.1.
-  (2017) Xception: deep learning with depthwise separable convolutions. In CVPR, Cited by: §2.
The cityscapes dataset for semantic urban scene understanding. In CVPR, Cited by: Figure 1, 3rd item, §4.1.
-  (2019) Dual attention network for scene segmentation. In CVPR, Cited by: §1, §1.
MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. External Links: Cited by: §2.
-  (2019) Interlaced sparse self-attention for semantic segmentation. CoRR abs/1907.12273. External Links: Cited by: §2.
-  (2019) CCNet: criss-cross attention for semantic segmentation. In ICCV, Cited by: §1, §1, §2.
Dfanet: deep feature aggregation for real-time semantic segmentation. In CVPR, Cited by: §1, §2, §4.4, Table 3, Table 4.
-  (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1.
-  (2016) ENet: A deep neural network architecture for real-time semantic segmentation. CoRR abs/1606.02147. External Links: Cited by: Figure 1, §1, §3.2, §4.1, Table 3, Table 4.
-  (2017) Automatic differentiation in pytorch. Cited by: Figure 1, §4.1.
-  (2017) Full-resolution residual networks for semantic segmentation in street scenes. In CVPR, Cited by: Table 3.
-  (2017) ERFNet: efficient residual factorized convnet for real-time semantic segmentation. IEEE TITS PP (99), pp. 1–10. Cited by: Figure 1, §1, §2, §2, Figure 4, §4.1, §4.2, §4.4, Table 3.
-  (2018) Non-local neural networks. In CVPR, Cited by: §1, §3.1.
-  (2019) LEDNet: a lightweight encoder-decoder network for real-time semantic segmentation. arXiv preprint arXiv:1905.02423. Cited by: Figure 1, §1, §2, §2, Figure 3, §3.2, Figure 4, §4.1, §4.2, §4.4, Table 1, Table 3, Table 4.
-  (2018) CGNet: A light-weight context guided network for semantic segmentation. CoRR abs/1811.08201. External Links: Cited by: Figure 1, §2, §2, Table 3, Table 4.
-  (2018) BiSeNet: bilateral segmentation network for real-time semantic segmentation. In ECCV, Cited by: §1, §4.4, Table 3, Table 4.
-  (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In CVPR, Cited by: §2.
-  (2018) ICNet for real-time semantic segmentation on high-resolution images. In ECCV, Cited by: Figure 1, Table 3, Table 4.
-  (2019) Asymmetric non-local neural networks for semantic segmentation. In ICCV, Cited by: §1, §2, §3.4.