The semantic segmentation requires a lot of computational cost. The dilated convolution relieves this burden of complexity by increasing the receptive field without additional parameters. For a more lightweight model, using depth-wise separable convolution is one of the practical choices. However, a simple combination of these two methods results in too sparse an operation which might cause severe performance degradation. To resolve this problem, we propose a new block of Concentrated-Comprehensive Convolution (CCC) which takes both advantages of the dilated convolution and the depth-wise separable convolution. The CCC block consists of an information concentration stage and a comprehensive convolution stage. The first stage uses two depth-wise asymmetric convolutions for compressed information from the neighboring pixels. The second stage increases the receptive field by using a depth-wise separable dilated convolution from the feature map of the first stage. By replacing the conventional ESP module with the proposed CCC module, without accuracy degradation in Cityscapes dataset, we could reduce the number of parameters by half and the number of flops by 35 fastest models. We further applied the CCC to other segmentation models based on dilated convolution and our method achieved comparable or higher performance with a decreased number of parameters and flops. Finally, experiments on ImageNet classification task show that CCC can successfully replace dilated convolutions.READ FULL TEXT VIEW PDF
|(a) Input image|
|(b) Ground Truth|
|(c) Result of ESPnet (Param : 0.364M)|
|(d) Result of ds-Dilate (Param : 0.187M)|
|(e) Result of our CCC block (Param : 0.198M)|
Deep network-based semantic segmentation algorithms largely have enhanced the performance, but suffer from heavy computation. It is basically because the segmentation task involves in pixel-wise classification and also because many recent models are designed to have a large receptive field. To resolve this problem, lightweight segmentation models have been actively studied recently.
The recent segmentation studies show that dilated convolution which expands the receptive field without increasing the number of parameters can be an important technique for achieving the goal. Using the dilated convolution, many researches have made efforts to increase the speed [16, 14, 15, 20]. Among them, ESPnet  works fast by adopting a well-designed series of spatial pyramid dilated convolutions.
To further reduce the number of parameters and the amount of calculation in the dilated convolution, combining depth-wise (channel-wise) separable convolutions with dilated convolutions is one of the popular methods [5, 9]. However, this combination commonly degrades performance as shown in Figure 1.
Figure 1 (c) and (e) are the results of ESPnet-based networks with the same encoder-decoder structure while (d) increases the number of layers in the encoder structure for similar complexity with (e). They all use different convolution blocks. Figure 1 (d) is obtained using simple depth-wise separable dilated convolution blocks, which show a severe grid effect. On the other hand, our proposed method greatly improves the segmentation quality without any grid effect as shown in Figure 1 (e).
In this paper, we propose a new lightweight segmentation model which can achieve real-time execution time in an embedded board (Nvidia-TX2 board) as well as comparable accuracy compared to state-of-the-art lightweight segmentation algorithms. To achieve this goal, we describe a new concentrated-comprehensive convolution (CCC) block that is composed of two depth-wise asymmetric convolutions and a depth-wise dilated convolution, which can replace any normal dilated convolution operation. Our CCC block incorporates a balance between local and global consistency without requiring enormous complexity. Also our method has all the advantages of both the depth-wise separable convolution and the dilated convolution.
Our segmentation model based on CCC blocks reduces the number of parameters by almost half and the computational cost in flops by 35% compared to the original ESPnet. The proposed CCC block can substitute arbitrary segmentation networks that use the dilated convolution for reducing complexity with a comparable performance. Qualitatively, we prove that the proposed convolutional block can alleviate the grid effect which is reported to be one of main reasons for performance degradation in segmentation. Furthermore, we also prove that our method can be applied flexibly to other tasks such as classification as well.
As deep convolutional networks require a lot of convolution operations, there are restrictions on using them in real-time applications. To resolve this problem, convolution factorization methods have been applied.
Convolution Factorization: Convolution factorization divides a convolution operation into several stages to reduce the computational complexity of regular convolutions. In Inception [22, 23, 21] used convolution to reduce the number of channels. Also, decreasing the amount of computation more, a convolution with a large kernel size is divided into a combination of small-sized convolutions while keeping the size of the receptive field constant. Xception  and MobileNet  use the depth-wise separable convolution (ds-Conv), which performs spatial and cross-channel operations separately in the calculation of convolution. The ds-Conv greatly reduces the amount of computation without a large drop in performance. MobileNetV2  is further developed by applying an inverted residual block to MobileNet . In addition, ResNeXt  which has ResNet  as a basic structure, has shown good performance by applying a group convolution on certain channels. ShuffleNet  applied the point-wise group convolution, which performs an convolution on specific channels only rather than applying it to all channels, to reduce computation.
Dilated Convolution: Dilated convolution  inserts holes between the consecutive kernel pixels to obtain information in a wider scale. Dilated convolution is a widely used method in segmentation due to a large receptive field without increasing the amount of computation and parameters. In DeepLabV2 and DeepLabV3 [2, 3], atrous spatial pyramid pooling (ASPP) was introduced using the pyramid structure of dilated convolution to deal with multi-scale information. DenseASPP  repeats the concatenation of a set of different dilate-rated convolutional layers to generate more dense multi-scale feature representation. DRN  showed good performance in segmentation and classification compared to ResNet  by adding dilated convolution in ResNet 
and modifying the residual connection.
Recent researches start to combine dilated convolution with ds-Conv. In machine translation, 
proposed a super-separable convolution that divides the depth dimension of a tensor into several groups and performs ds-Conv for each. However, they did not solve performance degradation from combining the dilated convolution and the ds-Conv. DeepLabv3+ applies the ds-Conv to the ASPP  and a decoder module based on an Xeception model .
Lightweight Segmentation: Enet  is the first architecture designed for real-time segmentation. Since then, ESPNet  has led to more improvements than ever before in both the speed and the performance. It has introduced a lightweight and accurate network by applying a point-wise convolution to reduce the number of channels and then, using a spatial pyramid of dilated convolutions. RTSEG  proposed the meta-architecture which assembles several existing networks into an encoder-decoder structure. ERFNet  used residual connections and factorized a dilated convolution. The authors of  proposed a new training method which gradually changes a dense convolution to a grouped convolution. They applied a new method to ERFNet  and led to a -times improvement in speed . ContextNet  designed two network branches for global context and detail information.
Previously, segmentation studies focused on accuracy, but researches on lightweight segmentation are becoming more active. Most models use dilated convolution and ds-Conv properly. However, a simple combination of these induces serious degradation in accuracy. Our method resolves this problem by replacing the dilated convolutions with the newly proposed concentrated-comprehensive convolutions (CCC) blocks. We achieved real-time segmentation on an embedded board without performance degradation because CCC block takes all the advantages of the dilated convolution and the ds-Conv. The proposed CCC is not restricted to a specific model but has a flexibility to be applied to other models based on dilated convolutions.
Combining dilated convolution and depth-wise separable convolution sometimes induces degradation of accuracy. This is due to the skipping of neighboring information. To resolve the problem, we propose a concentrated-comprehensive convolutions (CCC) block that maintains the segmentation performance without immense complexity. Figure 3 shows the detailed setting of the CCC block, In Figure 4, the overall structure of a CCC-ESP module based on ESPnet  is described. Furthermore, we analyze the reason for the disadvantage of depth-wise separable dilated (ds-Dilated) convolution in Section 3.1. Section 3.2 explains how the CCC block solves the problems mentioned in Section 3.1 with an asymmetric convolution. We extend the CCC block into a segmentation network based on ESPnet in Section 3.3.
|(a) Local information missing in dilated convolution|
|(b) Noise from irrelevant area in dilated convolution.|
Dilated convolution is an effective variant of the traditional convolution operator for semantic segmentation in that it creates a large receptive field without decreasing the resolution of the feature map. However, using the dilated convolution still requires heavy computational costs that prohibits it from being used in a lightweight model. Therefore, we adopt the key idea in the depth-wise separable convolution (ds-Conv) . Specifically, a dilated convolutional layer takes a convolutional operation on an input feature map using a convolutional kernel with a dilate ratio . The terms and are the number of input and output channels, the height and the width of the feature map, and the height and the width of the kernel, respectively. Using the kernel , we compute the output feature map with a dilate ratio as
Then, we can apply the depth-wise convolution to the dilated convolution as
Note that the process does not calculate cross-channel multiplication, but takes only the spatial multiplication. Hence, the kernel is for the depth-wise dilated convolution in the spatial dimension, and denotes a kernel for an point-wise convolution. Then, the parameter size is reduced from to when the kernel size is , i.e., . The number of floating point operations (flop) is also largely reduced from to . However, we can observe that this approximation leads significant performance degradation compared to the case of using the original dilated convolution.
We should also note that the dilated convolution has inherent risks as shown in Figure 2. First, local information can be missing. Since the dilated convolutional operation is spatially discrete according to the dilate ratio, the information loss is inevitable. Second, an irrelevant area across large distances affects the result of segmentation. Dilated convolution includes area far away from the target area in calculating the convolution, while impairs local consistency as well. For example, the target area in red square is misclassified due to the green squares beneath as in Figure 2(b).
Also, unlike standard convolution, a ds-Conv is independent among channels, and hence spatial information of other channels does not directly flow. From the observation in Figure 1 (d) , we conjecture that the loss of cross-channel information triggers the mentioned risks of dilated convolution, and degrades the performance.
We propose a concentrated-comprehensive convolutions (CCC) to compensate for the segmentation performance degradation based on the observations mentioned in the previous section. CCC block consists of information concentration stage and comprehensive convolution stage. The information concentration stage aggregates local feature information by using a simple convolutional kernel. The comprehensive convolution stage uses dilation to see the large receptive field, followed by the point-wise convolution that mixes the channel information. It is noted that we apply the depth-wise convolution to the dilated convolution to further reduce the parameter size.
As in Figure 2(a), feature information of white pixels in the region of centered on the red pixel will be lost when executing dilated convolution with dilation . The information concentration stage alleviates this loss of feature information by executing simple depth-wise convolution before dilated convolution, which compresses the skipped feature information as shown in Figure 3B.
We note that as the dilate ratio increases, the depth-wise convolution in information concentration stage becomes extremely inefficient. In most cases, the dilate ratio is up to 16, so it becomes intractable in an embedded system. Specifically, when we use regular depth-wise convolutions to a feature map with channels, the number of parameters is . When the size of the output of feature map is , the number of flops becomes . However, if the convolution kernel is separable, the kernel can be decomposed as . For an kernel, this separable convolution reduces the computational complexity per pixel from to . We solve the complexity problem by using two depth-wise asymmetric convolutions instead of a regular depth-wise convolution as shown in Figure 3C.
Comprehensive convolution stage uses depth-wise dilated convolution to widen a receptive field for global consistency. After that, we execute cross-channel operation with an point-wise convolution. Since CCC block shown in Figure 3C is a kind of ds-Conv, a little number of additional parameters and calculation are required. At the same time, CCC block also makes enough receptive field for segmentation from the comprehensive convolution stage. In summary, CCC block combines both advantages of the depth-wise separable convolution and the dilated convolution by integrating local and global information properly. Therefore, although segmentation network based on CCC block has fewer parameters and computational complexity than the original network, the proposed block can achieve good enough segmentation performance.
In this section, we describe our network design with CCC blocks. Figure 4(a) shows our proposed CCC module with a parallel structure which is modified from the ESP module  shown in Figure 4(b). The original ESP module, which has a parallel structure of spatial pyramid of dilated convolutions, each of which has a dilate ratio , firstly reduces the the number of channels in the input feature map by and then applies each dilated convolution to the reduced input feature map. In addition, outputs of dilated convolutions are element-wise summed hierarchically.
|ESP module||CCC module|
Likewise, when the number of CCC blocks is , we firstly reduce the number of channels by and then, we apply parallel CCC blocks to the reduced feature map. Unlike other studies [13, 18, 27], we simply concatenate the feature maps without additional post-processing. Furthermore, ESPnet set including the operation of dilate ratio of , but we exclude it to reduce computation. This exclusion is reasonable because in our CCC block, information concentration stage gathers neighboring information. Therefore, we set to view multiple levels of spatial information. When the largest dilate ratio is , the receptive field of the module is in a kernel as in the case of ESP module. The full comparison between the ESP module and our module is shown in the Table 1.
|Method||Extra dataset||Param(M)||Flop(G)||Class mIOU|
|FCN-8S (Long, Shelhamer, and Darrell 2015) ||ImageNet||134.5||970.15||65.3|
|Segnet (Badrinarayanan, Kendall, and Cipolla 2015) ||ImageNet||29.5||163.1||57.0|
|DeepLabV3+ (Chen et al. 2018) ||ImageNet||41.06||280.08||82.1|
|DenseASPP121 (Yang et al. 2018) ||ImageNet||8.28||155.82||76.2|
|DRN-A50 (Yu, Koltun, and Funkhouser 2017) ||ImageNet||23.55||400.15||67.3|
|DRN-C26 (Yu, Koltun, and Funkhouser 2017) ||ImageNet||20.62||355.18||68.0|
|Enet (Paszke et al. 2016) ||no||0.364||8.52||58.3|
|Skip-Suffle (Siam et al. 2018) ||no||1.0||6.22||58.3|
|Skip-Mobile (Siam et al. 2018) ||no||3.37||15.38||62.4|
|ContextNet-cn12 (Poudel et al. 2018) ||no||0.85||32.68||68.7|
|ERFnet (Romera et al. 2017) ||no||2.1||53.48||68.0|
|ESPnet (Mehta et al. 2018) ||no||0.364||9.67||60.3|
|ESPnet-tiny (Mehta et al. 2018) ||no||0.202||6.48||56.3|
|CCC1 (d=2, 4, 8, 16)||no||0.198||6.45||60.98|
|CCC2 (d=2, 3, 7, 13)||no||0.192||6.29||61.96|
As in ESPnet, the overall network adopted the feature re-sampling method that concatenates the feature map with the one obtained by reducing the resolution of the original input image. For designing our encoder, we simply adopted the similar structure of the encoder in ESPnet. Since our model is more efficient compared to ESPnet, the number of parameters is significantly reduced. The decoder concatenates the feature maps of the encoders of each stage to recover the original size, and also replaces the existing ESP module in the decoder with our CCC module.
We note that our CCC block can be adapted not only to ESPnet but also to other deep learning network based on dilated convolution for making a more efficient model. By changing the dilated convolutions in the segmentation models to the proposed CCC block, we can reduce the number of parameters and computational cost with enhancing the segmentation performance compared to the original network. Detailed results will be provided in Section4.
|(2)||Depth-wise separable ESPnet||O||0.128||4.93||57.81|
|(3)||wo Concenrated-stage (Structure A)||X||0.152||5.44||54.24|
|(4)||wo Concenrated-stage++ (Structure A)||X||0.187||6.26||55.22|
|(5)||with regular dw-Conv Concentration stage (Structure B)||X||0.580||15.68||58.56|
|(6)||Our proposed CCC with dw-Asym (structure C)||X||0.194||6.40||59.57|
|(7)||Our proposed CCC with dw-AsymNL (structure C)||X||0.198||6.45||60.98|
CCC: Concentrated-Comprehensive Convolution. ++: increase the number of layer in encoder structure for similar parameters and flops. dw-Conv: depth-wise convolution. dw-Asym: depth-wise asymmetric convolution. dw-AsymNL: insert Batch normalization and PReLU between depth-wise asymmetric convolution. All results are reproduced in the PyTorch framework under same data augmentation and setting. The input size isfor calculating flops.
We evaluated the proposed method and performed ablation study on CityScape dataset , which is widely used in semantic segmentation. All the performances were measured using mean intersection over union (mIOU), the number of parameters, and flops. To show the flexibility of our CCC, we performed a classification task on ImageNet dataset  using the CCC block.
Experimental Setting: Cityscape dataset consists of multiple classes about urban street scenes. The number of training and validation images are 2,975 and 500, respectively. To train the model, we followed the standard data augmentation strategy including random scaling, cropping and flipping. The learning rate was set to and multiplied by
at each 150 and 200 epochs (total 300 epochs). Adam optimizer with momentum of and the weight decay was used for the training. Other further experiments (CCC-ERFnet and CCC-DRN) followed the settings in the respective original papers [16, 27].
Experimental Result on Cityscape Dataset: Table 2 and Figure 5 show the evaluation results of the baseline segmentation networks, with switching the dilated convolutional block of the models to the proposed CCC block. Both of CCC1 and CCC2 use ESPnet as a baseline network with varying dilation ratio , which is and , respectively. We note that the HFF denoising module used in the original ESP is not used in CCC1 and CCC2. CCC-ESPnet used all the settings including HFF module and a convolution of dilated ratio , but the mIOU was not that different from CCC1 and CCC2. The results show that CCC2 outperforms CCC1 about with less parameters. This supports the proposition of the previous work , that the dilation ratios should be coprime.
ESPnet-tiny model is a smaller version of ESPnet from the original paper, which has similar number of parameter to the proposed CCC1 and CCC2. The result shows that both CCC1 and CCC2 outperform ESPnet-tiny with significant margins, for CCC1 and for CCC2. On Nvidia-TX2 board with resolution, CCC1 and CCC2 each marked FPS and FPS. Thus, both can be regarded as real-time methods. .
To prove the general applicability of our CCC block, we conducted further experiments with the other segmentation models: ERFnet, DRN-A50, DRN-C26, which are based on dilated convolution. For every model, our method achieved comparable or higher performance with the reduced number of parameters and flops. Especially, CCC-ERFnet showed more than 1% performance enhancement with 30% reduced parameters. CCC-DRN-A50 reduced the number of parameters and flops by about from the baseline DRN-A50, with slight performance drop (). In CCC-DRN-C26 case, the number of parameters and flops were reduced by more than from the DRN-C26, but the performance drop was only . The overall results show that the proposed CCC block can be incorporated with diverse segmentation models which use dilated convolution block, without much performance degradation.
|(a) Img and GT||(b) structure-A||(c) structure-B||(d) structure-C||(e) structure-C with NL|
Ablation Study on CCC: In this section, we show ablation results of the proposed CCC block based on Figure 3. The experiments uses the network based on ESPnet as introduced in section 3. Experiment (1) is about original ESPnet with dilated ratio . Experiment (2) uses the same network structure with the original ESPnet, but the standard dilated convolutions are substituted to depth-wise separable dilated convolutions (ds-Dilate). Experiments (3)-(7) are ablation studies of the proposed method. We use dilated ratio to the experiments (3)-(7). In the experiments, batch normalization was added between the concentration stage and comprehensive convolution stage, and HFF degridding module was not used. Experiment (3) shows the result only with ds-Dilate without adding the information concentration stage. Experiment (4) is a same ablation test to the Experiments (3) with larger number of layers to have similar parameter size and flops to the proposed method (Experiment-(6)). Experiment (5) is structure B in Figure 3. The information concentration stage consists of depth-wise convolution (dw-Conv) in Experiment (5). Experiment (6) is the proposed CCC block (structure C) in Figure 3
. This model reduces the number of parameters and flops by factorizing dw-Conv in Experiment (5) to depth-wise asymmetric convolution (dw-Asym). Experiment (7) is the version inserting the batch normalization and pReLU between the two dw-Asym in experiment (6). ESPnet used pReLU for acivation function, so we also use pReLU instead of ReLU between dw-Asym.
As shown in Table 3 (2)-(4), a naive usage of the depthwise separable architecture brought significant degradation of the performance (about 3%), and HFF module could not fully resolve the performance degradation. From the experiments (3) and (4), we can conclude that the information concentration stage is critical for resolving the accuracy drop from ds-Dilate, regardless of the number of parameters. Experiment (5) shows that we can simply use depth-wise convolution at the information concentration stage, and this can achieve better performance than those using HFF. However, the number of parameter is quite large in this case. In Experiment (6), we showed that the proposed model drastically reduce the number of parameters and flops and could achieve the better performance than that of Experiment (5), as well. From experiment (7), we found that adding non-linearity between the asymmetric convolution can marginally enhance the performance about to that of experiment (6).
Figure 6 is visualized output feature maps of various structures which are obtained by averaging the feature maps of different channels. The visualized feature maps of structure-A are not clear and has serious grid effect as shown in Figure 6 (b). Whereas, by proceeding information concentration stage, it improves the quality of feature maps (see (c)-(e)). Note the artifacts in structure-C (Figure 6(d)) have almost removed by adding nonlinearity between the asymmetric convolutions (Figure 6(e)).
|CCC DRN C-26||7.85||6.59||73.43||91.33|
|CCC DRN C-28*||13.11||10.72||74.39||91.99|
|CCC DRN C-42||9.63||8.17||74.96||92.31|
|CCC DRN C-44*||14.90||12.3||75.64||92.63|
Experimental Setting: The evaluation is tested on ImageNet-1k  dataset. Training was performed with SGD with momentum 0.9 and weight decay . Learning rate was initially set to and is reduced by a factor of 10 every 30 epochs. The training was until 120 epochs with the same data augmentation method following DRN .
|(a) Image||(b) DRN C-26||(c) CCC DRN C-44|
ImageNet 1k Benchmark Results: From the experiments on the segmentation task, we have shown that the proposed CCC block can successfully substitute the dilated convolutional block. One step further from segmentation, we tested the proposed block to the classification networks which incorporate the dilated convolutional scheme to skip-connection (DRN) . The result in Table 4 shows that the proposed method can achieve comparable classification accuracy considering that the parameter size was reduced.
For example, the CCC-DRN-C-42 reduced the parameters and flops by more than half compared to the DRN-C-26, but the top1 accuracy dropped only 0.18%. Also CCC DRN C-44* reduced the number of parameters by 30% and the number of flop by half, but better accuracy than DRN-C26. The low resolution feature map degrades the classification accuracy and localization power according to . DRN showed that adding dilated convolution strengthen the localization power and hence can enhance the classification accuracy without adding further parameters. By changing the dilated convolutional parts in DRN to the proposed CCC block, we tested whether our method can substitute the dilated convolution in this case, as well.
We visualized the dense pixel-level class activation map from the DRN and our CCC-DRN to show each model’s localization power in Figure 7. From the figure, we can see that the proposed method generated the activation maps with better resolution than DRN, which means that our method also have enough localization capacity to DRN.
In this work, we proposed Concentrated-Comprehensive Convolutions (CCC) block for lightweight semantic segmentation. Our CCC block comprises of the information concentration stage and the comprehensive convolution stage both of which are based on two consecutive convolutional blocks. More specifically, the former block improves the local consistency by using two depth-wise asymmetric convolutions, and the latter one increases a receptive field by using a depth-wise separable dilated convolution. Throughout the extensive experiments, it turns out that our proposed method could integrate local and global information effectively and reduce the number of parameters and flops significantly. Furthermore, CCC block is generally applicable to other semantic segmentation models and can be adopted to other tasks such as classification.
|ReLU or PReLU|
The cityscapes dataset for semantic urban scene understanding.In , pages 3213–3223, 2016.
Depthwise separable convolutions for neural machine translation.In International Conference on Learning Representations, 2018.
Aggregated residual transformations for deep neural networks.In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
Table 5 shows how we calculated flops for each operation.
The following notations are used.
: A input feature map
: A output feature map
: A convolution kernel
: A height of convolution kernel
: A width of convolution kernel
: A height of input feature map
: A width of input feature map
: A input channel dimension of feature map or kernel
: A output channel dimension of feature map or kernel
: A group size for channel dimension
: A height of output feature map
: A width of output feature map
: A non-linear activation function