1 Introduction
Semantic segmentation can be viewed as a task of pixelwise classification, which assigns a specific predefined category to each pixel in an image. The task has many potential applications in autonomous driving or image editing and so on. Although many works [11, 6, 9] have made great progress in the accuracy of image semantic segmentation tasks, their network size, inference speed, computation and memory cost limit their practical applications. Therefore, it’s essential to develop the lightweighted, efficient and realtime methods for semantic segmentation.
test set. The smaller bubble means fewer parameters. We compare LRNNet with methods implemented in opensource deeplearning frameworks, such as Pytorch
[13]and Caffe, including ICNet
[21], CGNet [18], ERFNet [15], SegNet [1], ENet [12] and LEDNet [17]..
Among these properties, lightweight could be the most essential one, because using a smaller scale network can lead to faster speed and more efficient in computation or memory cost easier. A less learnable network parameter means that the network has less redundancy and network structure makes its parameters more effective. Practically, a smaller model size is more favorable for cellphone apps. Designing blocks with proper factorized convolution to construct an effective lightweighted network could be a more scalable way and could be easier to balance among accuracy, network size, speed, and efficiency. In this way, many works [12, 19, 15, 17, 10] achieve promising results and show their potential for many potential applications. But those works [15, 17] do not balance factorized convolution and longrange features in a proper way.
The recent study [16]
shows the powerful potential of attention mechanism in computer vision. Nonlocal methods are employed to model longrange dependencies in semantic segmentation
[6]. However, modeling relationships between every position could be rather heavy in computation and memory cost. Some works try to develop factorized [9] or reduced [22]nonlocal methods to make it more efficient. Since efficient nonlocal or positional attention is not developed enough for lightweighted and efficient semantic segmentation, our approach tries to develop a powerful reduced nonlocal method to model longrange dependencies and global feature selection efficiently.
In our work, we develop a lightweighted factorized convolution block (FCB) (Fig. 4) to build a feature extraction network (encoder), which deals with longrange and shortrange features with proper factorized convolution respectively, and we proposed a powerful reduced nonlocal module with regional singular vectors to model longrange dependencies and global feature selection for the features from encoder to enhance the segmentation results. Contributions are summarized as follows:

We proposed a factorized convolution block (FCB) to build a very lightweighted, powerful and efficient feature extraction network by dealing with longrange and shortrange features in a more proper way.

The proposed efficient reduced nonlocal module (SVN) utilizes regional singular vectors to produced more reduced and representative features to model longrange dependencies and global feature selection.
2 Related Work
Lightweighted and Realtime Segmentation. Realtime semantic segmentation approaches aim to generate highquality prediction in limited time, which is usually performed under resource constraints or mobile applications. Lightweight models save storage space and potentially have lower computation and faster speed. Therefore, developing lightweight segmentation is a potential way to get a good tradeoff for realtime semantic segmentation [18, 17, 15]. Our model follows the lightweight style to achieve realtime segmentation.
Factorized Convolution. Standard convolution adopts a 2D convolution kernel to form a full connection between input and output channels, which learns local relation and channel interaction. However, this may suffer from the large parameter size and redundancy for realtime tasks under resource constraints. Xception [4] and MobileNet [7] adopt depthwise separable convolution, which consists of a depthwise convolution followed by a pointwise convolution. Depthwise convolution learns local relation in every channel and pointwise convolution learns the interaction between channels to reduce parameters and computation. ShuffleNet [20] adopts a splitshuffle strategy to reduce parameters and computation. In this strategy, standard convolution is split into some groups of channels and a channel shuffle operation helps the information flows between groups. Factorizing the 2D convolution kernel into a combination of two 1D convolution kernels is another way to reduce parameter size and computation cost. Many lightweight approaches [17, 15] take this way and get promising performances. In this paper, our convolution factorization block (FCB) utilizes these strategies to build a lightweighted, efficient and powerful structure.
Attention Models. Attention modules model longrange dependencies and have been applied in many computer vision tasks. Position attention and channel attention are two important mechanisms. Channel attention modules are widely applied in semantic segmentation [10] including some lightweighted approaches [18, 10]. Position attention or nonlocal methods have a higher computational complexity. Although some works [9, 8, 22] try to develop more efficient nonlocal methods, position attention or nonlocal methods are rarely explored in lightweighted semantic segmentation.
3 Methodology
We introduce the preliminary related to our SVN module in section 3.1, network architecture in section 3.2, the proposed FCB unit in section 3.3 and SVN module in section 3.4.
3.1 Preliminary
Before introducing the proposed method, we first introduce the singular value decomposition and nonlocal method, which are related to our SVN module in section 3.4.
Singular Value Decomposition and Approximation. Given a real matrix , with real numbers , there exist two orthogonal matrices and , satisfying Equation 1,
(1) 
where , and . If we choose , we can get
(2) 
where approximates the original , because the larger singular values and their singular vectors keep most of the information of . The corresponding singular vectors of the larger singular value contain more information of the matrix, especially the dominant singular vectors. We can calculate the dominant singular vectors by power iteration Algorithm 1 efficiently. Based on Equation 1, rotating columns of does not change and and their singular values.
Nonlocal Module. Nonlocal module [16] models global feature relationships. We illustrate it in the form of QueryKeyValue. It can be formulated as:
(3) 
where , is a Query, is the corresponding output of , , is a Key, , is a Value, is the measure of similarity between and , and is a normalization function, , and are the collections of the Queries, the Keys and the Values, respectively. And a smaller means less computation.
3.2 Overview of the Network Architecture
In this section, we introduce our network architecture. Our LRNNet consists of a feature extraction network constructed by our proposed factorized convolution block (Fig. 4(c)) and a pixel classifier enhanced by our SVN module (Fig. 2).
We form our encoder in a threestages ResNetstyle ( Fig. 2(a)). We adopt the same transition between stages as ENet [12] using a downsampling unit. The core components are our FCB units, which provide lightweighted and efficient feature extraction. For better comparison of other lightweight factorized convolution block, we adopt the same dilation series of FCB in encoder as LEDNet [17] after the last downsampling unit (details in supplemental material). Our decoder ( Fig. 2(b)) is a pixelwise classifier enhanced by our SVN module.
image  ground truth  SSnbt [17]  Model A  Model B  Model C 
3.3 Factorized Convolution Block
Designing a factorized convolution block is a popular way to achieve lightweighted segmentation. Techniques like dilated convolution for enlarging receptive field are also important for semantic segmentation models. Our factorized convolution block is inspired by the observation that 1D factorized kernel could be more suitable for spatially less informative features than the spatially informative features. Consider the situation of a convolution kernel is replaced by a convolution kernel followed by a
convolution kernel, which could have the same receptive field and fewer parameters. However, neglecting the information lost of crossing the activate function between the two 1D convolution kernel, it could be a rank1 approximation for the
convolution kernel. Assuming that different spatial semantic regions have different features, if the dilation of convolution kernel is one or small, the kernel may not lay across multiple different spatial semantic regions and the receptive features are less informative and simple so that the rank1 approximation is more likely to be effective, and vice versa.Therefore, the convolution kernel with large dilation will receive complex or spatially informative longrange features (features separated with a large dilation) in space, and it needs more parameters in space. Meanwhile, a convolution kernel with small dilation will receive simple or less informative shortrange features in space, and fewer parameters in space are enough. Our FCB (Fig. 4(c)
) first deals with shortrange and spatially less informative features with 1D factorized convolution in two split groups, which is fully connected in channel, so the factorized convolution reduces the parameter and computation a lot. To enlarge the receptive field, our FCB utilizes 2D kernel with larger dilation and use depthwise separable convolution to reduce the parameter and computation. A channel shuffle operation is set at last because there is a residual connection after the pointwise convolution. In total, FCB uses a lowrank approximation (1D kernel) in space for shortrange features and depthwise spatial 2D dilated kernel for longrange features, which lead to more lightweight, efficient and powerful feature extraction.
Compared with other factorized convolution blocks (Fig. 4), our FCB has a more elaborate design, fewer parameters, less computation, and faster speed, which will be shown in the experiment part further.
3.4 SVN Module
A lightweighted model can hardly achieve powerful feature extraction as a big network. Therefore, to produce reduced, robust and representative features and combine them into nonlocal modules is an essential way to explore the efficient nonlocal mechanism for lightweighted semantic segmentation. We revisit the nonlocal mechanism in the form of QueryKeyValue and claim that using the reduced and representative features as the Keys and the Values could reduce computation and memory, as well as maintain effectiveness.
Our SVN module is presented in Fig. 2(b). We reduced the cost in two ways, which are forming a bottleneck by Conv1 and Conv2 to reduced channels for nonlocal operation and replacing the Keys and Values by their regional dominant singular vectors. The proposed SVN consists of two branches. The lower branch is a residual connection from the input. The upper branch is the bottleneck of our reduced nonlocal operation. In the bottleneck, we divide the feature maps into spatial subregions. We divide feature maps into () spatial subregions with a scale of . For each subregion, we flatten it into a matrix, then use the Power Iteration Algorithm 1 to calculate their left dominant singular vectors (
) efficiently. As is mentioned in Sec 3.1, rotating columns does not affect the left orthogonal matrix, so the left dominant singular vector is agnostic to the way of flattening and this property is similar to pooling. Then the regional dominant singular vectors are used as the
and for the nonlocal operation, where a smaller S means less computation, and the are positional vectors () from the feature maps before dominant singular vector extraction. To enhance the reduced nonlocal module, we also perform multiscale region extraction and gather dominant singular vectors from different scales as the Keys and Values (see Fig. 2(b) and Equation 4).(4) 
where is the output of SVN, s are collections of regional dominant singular vectors from their related scales, the regional dominant vectors are used as both the Keys () and Values (), is a Query from feature maps before dominant singular vectors extraction, and our SVN uses dot product.
As is illustrated above, the SVN module forms a reduced and effective nonlocal operation by bottleneck structure and reduced and representative regional dominant singular vectors. The regional dominant singular vector could be the most representative for a region of feature maps. Since some works [22] utilize pooling as the Keys and Values, we will compare the pooling, singlescale and multiscale region extraction in our structure in the ablation experiments.
4 Experiments
We conduct experiments to demonstrate the performance of our FCB block and SVN module and the stateoftheart tradeoff among lightweight, accuracy and efficiency of our proposed segmentation architecture. For ablation, we denote our LRNNet without SVN, with singlescale SVN and multiscale SVN as Model A, Model B and Model C, respectively.
4.1 Datasets and Settings
The Cityscapes dataset [5] consists of highquality pixellevel annotations of 5000 street scenes 2048 1024 images and there are 2975, 500 and 1525 images in the training set, validation set and test set respectively. Following the lightweighted approaches [17, 15], we adopt 5121024 subsampled image for testing. The CamVid dataset[2] contains 367 training, 101 validating and 233 testing images with a resolution of 960720, but we follow the setting as [12, 1] using 480360 images for training and testing.
We implement all our methods using Pytorch [13] on a single GTX 1080Ti. Following [3], we employ a poly learning rate policy and the base learning rate is 0.01. The batch size is set to 8 for all training. For CamVid testing and evaluation on Cityscapes validation set, we take 250k iterations for training to study our network quickly. And we only train our model on fine annotations for Cityscapes test set with 520K iterations.
Model  mIoU  Times(ms)  Para(M)  GFLOPS 

SSnbt [17]  69.6  14  0.95  11.7 
Model A  70.6  13  0.67  8.48 
Max Pooling  70.2  14  0.68  8.54 
Avg Pooling  70.3  14  0.68  8.54 
Model B  71.1  14  0.68  8.57 
Model C  71.4  14  0.68  8.58 
Subregion  mIoU  Times(ms)  GFLOPS 

nonlocal (64128)  71.2  22  12.5 
SS (1616)  71.1  15  8.68 
SS (88)  71.1  14  8.57 
SS (44)  70.8  14  8.51 
MS (88+44)  71.4  14  8.58 
MS (88+44+22)  71.4  14  8.59 
4.2 Ablation Study for FCB
Comparing with other factorized convolution blocks shown in Figure 4, ERFNet [15] and LEDNet [17] simply use 1D factorized kernel to deal with shortrange and longrange (with dilation) features. As is analyzed in Section 3.3, our FCB deals with shortrange features with 1D factorized kernel and longrange features with the 2D depthwise kernel. We compare our FCB (Model A) with SSnbt from LEDNet [17] in the same architecture. As shown in Table 1 and 4, our FCB (Model A) achieves better accuracy with lower parameter size, computation and inference time comparing with SSnbt which using 1D factorized kernel for both shortrange and longrange features. Visual examples are in Fig. 3.
4.3 Ablation Study for SVN
Table 2 shows the performances of different subregion choices of our SVN module and the standard nonlocal. Balancing accuracy, speed and computation cost, we choose 64 (8) subregion as singlescale SVN (Model B) and 88+44 subregions for multiscale SVN (Model C).
We analyze the efficiency of our SVN. Since Algorithm 1 converges efficiently, we set the T as 2, whose computation complexity is . The features in our bottleneck on Cityscapes is , and the computation is 4.0 GFLOPS in standard nonlocal operation neglecting the convolution and the complexity is . For our reduced nonlocal operation, the complexity is , where S () is the number of the Keys and Values. For singlescale SVN (Model B) and pooling, we divide feature maps into 64 subregions and the computation is 32 MFLOPS. For multiscale SVN, feature maps into 64 and 16 subregions, and the computation is 40 MFLOPS. The cost of Power Iteration in singlescale and multiscale SVN are 1 MFLOPS and 2 MFLOPS, respectively.
Model 
Subsample 
Pretrained 
mIoU 
FPS 
Para(M) 
GFLOPS 
SegNet [1]  3  N  57.0  16.7  29.5  286 
ENet [12]  3  N  58.3  135  0.37  3.8 
FRRN [14]  2  N  71.8  0.25  24.8  235 
ICNet [21]  1  Y  69.5  30.3  26.5  28.3 
ERFNet [15]  2  N  68.0  41.7  2.1  21.0 
CGNet [18]  3  Y  64.8  50.0  0.5  6.0 
BiSeNet [19]  4/3  N  68.4    5.8  14.8 
DFANet [10]  2  Y  70.3    7.8  1.7 
LEDNet [17]  2  N  69.2  71  0.94  11.5 
Model A  2  N  70.6  76.5  0.67  8.48 
Model B  2  N  71.6  71  0.68  8.57 
Model C  2  N  72.2  71  0.68  8.58 
We compare using regional dominant singular vectors (Model B) with using pooling features in single scale () (Table 1) to show the effectiveness of dominant singular vectors. Results on Cityscapes validation are shown in Table 1. Comparison of Model singlescale SVN (Model B) and pooling (max or average) shows that regional singular vectors are effective for our network with a lightweighted encoder with 0.5 mIoU improvement and additional 0.09 GFLOPS, while using pooling can not provide representative features for a lightweighted network. And the multiscale SVN (Model C) further improves the result to 71.4% mIoU with a little cost on inference time and computation.
4.4 Comparison with Other Methods
We compare our LRNNet with other lightweighted methods on Cityscapes and Camvid test sets in terms of parameter size, accuracy, speed and computation. We only report the speed of methods on the opensource deeplearning framework, such as Pytorch, TensorFlow and Caffe, because they have comparable implemented performance, but have a large gap comparing with the non opensource deeplearning framework of those works
[10, 19] and details are in supplemental material. Results are shown in Table 3 and 4. ”” indicates that the speed is not achieved by opensource deep learning frameworks or not provided. Our network constructed by FCB (Model A) achieves 70.6% mIoU and 76.5 FPS on Cityscapes test set with only 0.67M parameters and 8.48 GFLOPS, which is more lightweighted and efficient than ERFNet [15] and LEDNet [17] with better accuracy. With singlescale (Model B) and multiscale SVN (Model C), LRNNet achieves 71.6% and 72.2% mIoU on Cityscapes test set with a little cost on speed and efficiency, respectively. Our LRNNet with multiscale SVN achieves 69.2% mIoU with only 0.68M parameters on CamVid test set. All results show the stateoftheart tradeoff among parameter size, speed, computation and accuracy of our LRNNet. Visual comparison can be viewed in Fig. 1.Model 
Input Size 
Pretrained 
mIoU 
FPS 
Para(M) 
SegNet [1]  N  46.4  46  29.5  
ENet [12]  N  51.3    0.37  
ICNet [21]  Y  67.1  27.8  26.5  
CGNet [18]  Y  65.6    0.5  
BiSeNet [19]  N  65.6    5.8  
DFANet [10]  Y  64.7    7.8  
SSnbt [17]  N  66.6  77  0.95  
Model A  N  67.6  83  0.67  
Model B  N  68.9  77  0.68  
Model C  N  69.2  76.5  0.68 
5 Conclusion
We have proposed LRNNet for realtime semantic segmentation. The proposed FCB unit explores a proper form of factorized convolution block to deal with shortrange and longrange features, which provides lightweighted, efficient and powerful feature extraction for the encoder of our LRNNet. Our SVN module utilizes regional dominant singular vectors to construct the efficient reduced nonlocal operation, which enhances the decoder with a very low cost. Extensive experimental results have validated our stateoftheart tradeoff in terms of parameter size, speed, computation and accuracy.
6 Acknowledgement
This paper is supported by NSFC (No.61772330, 61533012, 61876109), the preresearch project (No.61403120201), Shanghai Key Laboratory of Crime Scene Evidence (2017XCWZK01) and the Interdisciplinary Program of Shanghai Jiao Tong University (YG2019QNA09).
References
 [1] (2017) Segnet: a deep convolutional encoderdecoder architecture for image segmentation. IEEE TPAMI 39 (12), pp. 2481–2495. Cited by: Figure 1, §4.1, Table 3, Table 4.
 [2] (2009) Semantic object classes in video: a highdefinition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: 3rd item, §4.1.
 [3] (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. External Links: Link, 1706.05587 Cited by: §4.1.
 [4] (2017) Xception: deep learning with depthwise separable convolutions. In CVPR, Cited by: §2.

[5]
(2016)
The cityscapes dataset for semantic urban scene understanding
. In CVPR, Cited by: Figure 1, 3rd item, §4.1.  [6] (2019) Dual attention network for scene segmentation. In CVPR, Cited by: §1, §1.

[7]
(2017)
MobileNets: efficient convolutional neural networks for mobile vision applications
. CoRR abs/1704.04861. External Links: Link, 1704.04861 Cited by: §2.  [8] (2019) Interlaced sparse selfattention for semantic segmentation. CoRR abs/1907.12273. External Links: Link, 1907.12273 Cited by: §2.
 [9] (2019) CCNet: crisscross attention for semantic segmentation. In ICCV, Cited by: §1, §1, §2.

[10]
(2019)
Dfanet: deep feature aggregation for realtime semantic segmentation
. In CVPR, Cited by: §1, §2, §4.4, Table 3, Table 4.  [11] (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1.
 [12] (2016) ENet: A deep neural network architecture for realtime semantic segmentation. CoRR abs/1606.02147. External Links: Link, 1606.02147 Cited by: Figure 1, §1, §3.2, §4.1, Table 3, Table 4.
 [13] (2017) Automatic differentiation in pytorch. Cited by: Figure 1, §4.1.
 [14] (2017) Fullresolution residual networks for semantic segmentation in street scenes. In CVPR, Cited by: Table 3.
 [15] (2017) ERFNet: efficient residual factorized convnet for realtime semantic segmentation. IEEE TITS PP (99), pp. 1–10. Cited by: Figure 1, §1, §2, §2, Figure 4, §4.1, §4.2, §4.4, Table 3.
 [16] (2018) Nonlocal neural networks. In CVPR, Cited by: §1, §3.1.
 [17] (2019) LEDNet: a lightweight encoderdecoder network for realtime semantic segmentation. arXiv preprint arXiv:1905.02423. Cited by: Figure 1, §1, §2, §2, Figure 3, §3.2, Figure 4, §4.1, §4.2, §4.4, Table 1, Table 3, Table 4.
 [18] (2018) CGNet: A lightweight context guided network for semantic segmentation. CoRR abs/1811.08201. External Links: Link, 1811.08201 Cited by: Figure 1, §2, §2, Table 3, Table 4.
 [19] (2018) BiSeNet: bilateral segmentation network for realtime semantic segmentation. In ECCV, Cited by: §1, §4.4, Table 3, Table 4.
 [20] (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In CVPR, Cited by: §2.
 [21] (2018) ICNet for realtime semantic segmentation on highresolution images. In ECCV, Cited by: Figure 1, Table 3, Table 4.
 [22] (2019) Asymmetric nonlocal neural networks for semantic segmentation. In ICCV, Cited by: §1, §2, §3.4.
Comments
There are no comments yet.