Concentrated-Comprehensive Convolutions for lightweight semantic segmentation

by   Hyojin Park, et al.
Seoul National University

The semantic segmentation requires a lot of computational cost. The dilated convolution relieves this burden of complexity by increasing the receptive field without additional parameters. For a more lightweight model, using depth-wise separable convolution is one of the practical choices. However, a simple combination of these two methods results in too sparse an operation which might cause severe performance degradation. To resolve this problem, we propose a new block of Concentrated-Comprehensive Convolution (CCC) which takes both advantages of the dilated convolution and the depth-wise separable convolution. The CCC block consists of an information concentration stage and a comprehensive convolution stage. The first stage uses two depth-wise asymmetric convolutions for compressed information from the neighboring pixels. The second stage increases the receptive field by using a depth-wise separable dilated convolution from the feature map of the first stage. By replacing the conventional ESP module with the proposed CCC module, without accuracy degradation in Cityscapes dataset, we could reduce the number of parameters by half and the number of flops by 35 fastest models. We further applied the CCC to other segmentation models based on dilated convolution and our method achieved comparable or higher performance with a decreased number of parameters and flops. Finally, experiments on ImageNet classification task show that CCC can successfully replace dilated convolutions.


page 1

page 3

page 8


EADNet: Efficient Asymmetric Dilated Network for Semantic Segmentation

Due to real-time image semantic segmentation needs on power constrained ...

DABNet: Depth-wise Asymmetric Bottleneck for Real-time Semantic Segmentation

As a pixel-level prediction task, semantic segmentation needs large comp...

Augmentations: An Insight into their Effectiveness on Convolution Neural Networks

Augmentations are the key factor in determining the performance of any n...

ExtremeC3Net: Extreme Lightweight Portrait Segmentation Networks using Advanced C3-modules

Designing a lightweight and robust portrait segmentation algorithm is an...

SINet: Extreme Lightweight Portrait Segmentation Networks with Spatial Squeeze Modules and Information Blocking Decoder

Designing a lightweight and robust portrait segmentation algorithm is an...

BRAIN2DEPTH: Lightweight CNN Model for Classification of Cognitive States from EEG Recordings

Several Convolutional Deep Learning models have been proposed to classif...

Rotated Ring, Radial and Depth Wise Separable Radial Convolutions

Simple image rotations significantly reduce the accuracy of deep neural ...

Code Repositories

1 Introduction

(a) Input image
(b) Ground Truth
(c) Result of ESPnet (Param : 0.364M)
(d) Result of ds-Dilate (Param : 0.187M)
(e) Result of our CCC block (Param : 0.198M)
Figure 1: Illustration of performance degradation from depth-wise separable dilated convolution (ds-Dilate). (c): Original ESPnet [18], (d): ESPnet with increased layers using ds-Dilate, (e): Our model using CCC blocks with the same encoder-decoder structure as (c). Param denotes the number of parameters for each model.

Deep network-based semantic segmentation algorithms largely have enhanced the performance, but suffer from heavy computation. It is basically because the segmentation task involves in pixel-wise classification and also because many recent models are designed to have a large receptive field. To resolve this problem, lightweight segmentation models have been actively studied recently.

The recent segmentation studies show that dilated convolution which expands the receptive field without increasing the number of parameters can be an important technique for achieving the goal. Using the dilated convolution, many researches have made efforts to increase the speed [16, 14, 15, 20]. Among them, ESPnet [18] works fast by adopting a well-designed series of spatial pyramid dilated convolutions.

To further reduce the number of parameters and the amount of calculation in the dilated convolution, combining depth-wise (channel-wise) separable convolutions with dilated convolutions is one of the popular methods [5, 9]. However, this combination commonly degrades performance as shown in Figure 1.

Figure 1 (c) and (e) are the results of ESPnet-based networks with the same encoder-decoder structure while (d) increases the number of layers in the encoder structure for similar complexity with (e). They all use different convolution blocks. Figure 1 (d) is obtained using simple depth-wise separable dilated convolution blocks, which show a severe grid effect. On the other hand, our proposed method greatly improves the segmentation quality without any grid effect as shown in Figure 1 (e).

In this paper, we propose a new lightweight segmentation model which can achieve real-time execution time in an embedded board (Nvidia-TX2 board) as well as comparable accuracy compared to state-of-the-art lightweight segmentation algorithms. To achieve this goal, we describe a new concentrated-comprehensive convolution (CCC) block that is composed of two depth-wise asymmetric convolutions and a depth-wise dilated convolution, which can replace any normal dilated convolution operation. Our CCC block incorporates a balance between local and global consistency without requiring enormous complexity. Also our method has all the advantages of both the depth-wise separable convolution and the dilated convolution.

Our segmentation model based on CCC blocks reduces the number of parameters by almost half and the computational cost in flops by 35% compared to the original ESPnet. The proposed CCC block can substitute arbitrary segmentation networks that use the dilated convolution for reducing complexity with a comparable performance. Qualitatively, we prove that the proposed convolutional block can alleviate the grid effect which is reported to be one of main reasons for performance degradation in segmentation. Furthermore, we also prove that our method can be applied flexibly to other tasks such as classification as well.

2 Related Work

As deep convolutional networks require a lot of convolution operations, there are restrictions on using them in real-time applications. To resolve this problem, convolution factorization methods have been applied.

Convolution Factorization: Convolution factorization divides a convolution operation into several stages to reduce the computational complexity of regular convolutions. In Inception [22, 23, 21] used convolution to reduce the number of channels. Also, decreasing the amount of computation more, a convolution with a large kernel size is divided into a combination of small-sized convolutions while keeping the size of the receptive field constant. Xception [5] and MobileNet [9] use the depth-wise separable convolution (ds-Conv), which performs spatial and cross-channel operations separately in the calculation of convolution. The ds-Conv greatly reduces the amount of computation without a large drop in performance. MobileNetV2 [19] is further developed by applying an inverted residual block to MobileNet [9]. In addition, ResNeXt [26] which has ResNet [7] as a basic structure, has shown good performance by applying a group convolution on certain channels. ShuffleNet [28] applied the point-wise group convolution, which performs an convolution on specific channels only rather than applying it to all channels, to reduce computation.

Dilated Convolution: Dilated convolution [8] inserts holes between the consecutive kernel pixels to obtain information in a wider scale. Dilated convolution is a widely used method in segmentation due to a large receptive field without increasing the amount of computation and parameters. In DeepLabV2 and DeepLabV3 [2, 3], atrous spatial pyramid pooling (ASPP) was introduced using the pyramid structure of dilated convolution to deal with multi-scale information. DenseASPP [13] repeats the concatenation of a set of different dilate-rated convolutional layers to generate more dense multi-scale feature representation. DRN [27] showed good performance in segmentation and classification compared to ResNet [7] by adding dilated convolution in ResNet [7]

and modifying the residual connection.

Recent researches start to combine dilated convolution with ds-Conv. In machine translation, [10]

proposed a super-separable convolution that divides the depth dimension of a tensor into several groups and performs ds-Conv for each. However, they did not solve performance degradation from combining the dilated convolution and the ds-Conv. DeepLabv3+

[4] applies the ds-Conv to the ASPP [2] and a decoder module based on an Xeception model [5].

Lightweight Segmentation: Enet [14] is the first architecture designed for real-time segmentation. Since then, ESPNet [18] has led to more improvements than ever before in both the speed and the performance. It has introduced a lightweight and accurate network by applying a point-wise convolution to reduce the number of channels and then, using a spatial pyramid of dilated convolutions. RTSEG [20] proposed the meta-architecture which assembles several existing networks into an encoder-decoder structure. ERFNet [16] used residual connections and factorized a dilated convolution. The authors of [24] proposed a new training method which gradually changes a dense convolution to a grouped convolution. They applied a new method to ERFNet [16] and led to a -times improvement in speed . ContextNet [15] designed two network branches for global context and detail information.

Previously, segmentation studies focused on accuracy, but researches on lightweight segmentation are becoming more active. Most models use dilated convolution and ds-Conv properly. However, a simple combination of these induces serious degradation in accuracy. Our method resolves this problem by replacing the dilated convolutions with the newly proposed concentrated-comprehensive convolutions (CCC) blocks. We achieved real-time segmentation on an embedded board without performance degradation because CCC block takes all the advantages of the dilated convolution and the ds-Conv. The proposed CCC is not restricted to a specific model but has a flexibility to be applied to other models based on dilated convolutions.

3 Method

Combining dilated convolution and depth-wise separable convolution sometimes induces degradation of accuracy. This is due to the skipping of neighboring information. To resolve the problem, we propose a concentrated-comprehensive convolutions (CCC) block that maintains the segmentation performance without immense complexity. Figure 3 shows the detailed setting of the CCC block, In Figure 4, the overall structure of a CCC-ESP module based on ESPnet [18] is described. Furthermore, we analyze the reason for the disadvantage of depth-wise separable dilated (ds-Dilated) convolution in Section 3.1. Section 3.2 explains how the CCC block solves the problems mentioned in Section 3.1 with an asymmetric convolution. We extend the CCC block into a segmentation network based on ESPnet in Section 3.3.

3.1 Issues on Depth-wise Separable Dilated Convolution

(a) Local information missing in dilated convolution
(b) Noise from irrelevant area in dilated convolution.
Figure 2: Two major negative effects by dilated convolution. (a) The example of two consecutive layers of dilated convolution, pixels in white are not used for computing output feature regarding the red center. (b) The pixels in the green boxes influence the segmentation result of the area in the red box. From left to right, the original image, the ground truth, and the segmentation result are plotted.
Figure 3: Structure of Concentrated-Comprehensive Convolutions (CCC) block. Structure-A is standard depth-wise separable dilated convolution. Structure-B uses a regular depth-wise convolution for information concentration stage. Structure-C is our proposed CCC, which factorizes a regular depth-wise convolution to two asymmetric convolutions.

Dilated convolution is an effective variant of the traditional convolution operator for semantic segmentation in that it creates a large receptive field without decreasing the resolution of the feature map. However, using the dilated convolution still requires heavy computational costs that prohibits it from being used in a lightweight model. Therefore, we adopt the key idea in the depth-wise separable convolution (ds-Conv) . Specifically, a dilated convolutional layer takes a convolutional operation on an input feature map using a convolutional kernel with a dilate ratio . The terms and are the number of input and output channels, the height and the width of the feature map, and the height and the width of the kernel, respectively. Using the kernel , we compute the output feature map with a dilate ratio as


Then, we can apply the depth-wise convolution to the dilated convolution as


Note that the process does not calculate cross-channel multiplication, but takes only the spatial multiplication. Hence, the kernel is for the depth-wise dilated convolution in the spatial dimension, and denotes a kernel for an point-wise convolution. Then, the parameter size is reduced from to when the kernel size is , i.e., . The number of floating point operations (flop) is also largely reduced from to . However, we can observe that this approximation leads significant performance degradation compared to the case of using the original dilated convolution.

We should also note that the dilated convolution has inherent risks as shown in Figure 2. First, local information can be missing. Since the dilated convolutional operation is spatially discrete according to the dilate ratio, the information loss is inevitable. Second, an irrelevant area across large distances affects the result of segmentation. Dilated convolution includes area far away from the target area in calculating the convolution, while impairs local consistency as well. For example, the target area in red square is misclassified due to the green squares beneath as in Figure 2(b).

Also, unlike standard convolution, a ds-Conv is independent among channels, and hence spatial information of other channels does not directly flow. From the observation in Figure 1 (d) , we conjecture that the loss of cross-channel information triggers the mentioned risks of dilated convolution, and degrades the performance.

3.2 Concentrated-Comprehensive Convolution

We propose a concentrated-comprehensive convolutions (CCC) to compensate for the segmentation performance degradation based on the observations mentioned in the previous section. CCC block consists of information concentration stage and comprehensive convolution stage. The information concentration stage aggregates local feature information by using a simple convolutional kernel. The comprehensive convolution stage uses dilation to see the large receptive field, followed by the point-wise convolution that mixes the channel information. It is noted that we apply the depth-wise convolution to the dilated convolution to further reduce the parameter size.

As in Figure 2(a), feature information of white pixels in the region of centered on the red pixel will be lost when executing dilated convolution with dilation . The information concentration stage alleviates this loss of feature information by executing simple depth-wise convolution before dilated convolution, which compresses the skipped feature information as shown in Figure 3B.

We note that as the dilate ratio increases, the depth-wise convolution in information concentration stage becomes extremely inefficient. In most cases, the dilate ratio is up to 16, so it becomes intractable in an embedded system. Specifically, when we use regular depth-wise convolutions to a feature map with channels, the number of parameters is . When the size of the output of feature map is , the number of flops becomes . However, if the convolution kernel is separable, the kernel can be decomposed as . For an kernel, this separable convolution reduces the computational complexity per pixel from to . We solve the complexity problem by using two depth-wise asymmetric convolutions instead of a regular depth-wise convolution as shown in Figure 3C.

Comprehensive convolution stage uses depth-wise dilated convolution to widen a receptive field for global consistency. After that, we execute cross-channel operation with an point-wise convolution. Since CCC block shown in Figure 3C is a kind of ds-Conv, a little number of additional parameters and calculation are required. At the same time, CCC block also makes enough receptive field for segmentation from the comprehensive convolution stage. In summary, CCC block combines both advantages of the depth-wise separable convolution and the dilated convolution by integrating local and global information properly. Therefore, although segmentation network based on CCC block has fewer parameters and computational complexity than the original network, the proposed block can achieve good enough segmentation performance.

(a) CCC module

(b) ESP module

Figure 4: Network structure of CCC and ESP module. A is reducing channel of feature. B is a parallel structure of dilated convolutions. C is hierarchical feature fusion (HFF). D is skip-connection.

3.3 Segmentation Network with CCC Module

In this section, we describe our network design with CCC blocks. Figure 4(a) shows our proposed CCC module with a parallel structure which is modified from the ESP module [18] shown in Figure 4(b). The original ESP module, which has a parallel structure of spatial pyramid of dilated convolutions, each of which has a dilate ratio , firstly reduces the the number of channels in the input feature map by and then applies each dilated convolution to the reduced input feature map. In addition, outputs of dilated convolutions are element-wise summed hierarchically.

ESP module CCC module
Param Flop(M) Param Flop(M)
A 3,200 104.9 4,096 134.2
B 28,800 943.7 9,472 299.9
C - 0.066 - -
D - 0.016 - 0.016
Total 31,325 1,048.8 13,568 434.13
Table 1: Comparison of the number of parameters and flops between ESP and CCC module, under condition of input and output feature map. A-D are the operations in Figure 4.

Likewise, when the number of CCC blocks is , we firstly reduce the number of channels by and then, we apply parallel CCC blocks to the reduced feature map. Unlike other studies [13, 18, 27], we simply concatenate the feature maps without additional post-processing. Furthermore, ESPnet set including the operation of dilate ratio of , but we exclude it to reduce computation. This exclusion is reasonable because in our CCC block, information concentration stage gathers neighboring information. Therefore, we set to view multiple levels of spatial information. When the largest dilate ratio is , the receptive field of the module is in a kernel as in the case of ESP module. The full comparison between the ESP module and our module is shown in the Table 1.

Figure 5: Performance comparison of the segmentation models. Note that flops of all the methods are calculated with the image size of , except for ContextNet which is calculated with the image size of and Segnet which is calculated with the image size of due to their test image size. Our method achieved the best accuracy with the lowest complexity. (See the red and blue dots.)
Method Extra dataset Param(M) Flop(G) Class mIOU
FCN-8S (Long, Shelhamer, and Darrell 2015) [12] ImageNet 134.5 970.15 65.3
Segnet (Badrinarayanan, Kendall, and Cipolla 2015) [1] ImageNet 29.5 163.1 57.0
DeepLabV3+ (Chen et al. 2018) [4] ImageNet 41.06 280.08 82.1
DenseASPP121 (Yang et al. 2018) [13] ImageNet 8.28 155.82 76.2
DRN-A50 (Yu, Koltun, and Funkhouser 2017) [27] ImageNet 23.55 400.15 67.3
DRN-C26 (Yu, Koltun, and Funkhouser 2017) [27] ImageNet 20.62 355.18 68.0
CCC-DRN-A50 ImageNet 16.79 289.28 67.12
CCC-DRN-C26 ImageNet 7.34 137.44 67.6
Enet (Paszke et al. 2016) [14] no 0.364 8.52 58.3
Skip-Suffle (Siam et al. 2018) [20] no 1.0 6.22 58.3
Skip-Mobile (Siam et al. 2018) [20] no 3.37 15.38 62.4
ContextNet-cn12 (Poudel et al. 2018) [15] no 0.85 32.68 68.7
ERFnet (Romera et al. 2017) [16] no 2.1 53.48 68.0
CCC-ERFnet no 1.45 43.34 69.01
ESPnet (Mehta et al. 2018) [18] no 0.364 9.67 60.3
ESPnet-tiny (Mehta et al. 2018) [18] no 0.202 6.48 56.3
CCC1 (d=2, 4, 8, 16) no 0.198 6.45 60.98
CCC2 (d=2, 3, 7, 13) no 0.192 6.29 61.96
CCC-ESPnet no 0.21 6.75 61.06
Table 2: We referred the performance of the existing approaches reported in their original papers on CityScape benchmark. The flop was calculated on resolution except for ContextNet and Segnet.

As in ESPnet, the overall network adopted the feature re-sampling method that concatenates the feature map with the one obtained by reducing the resolution of the original input image. For designing our encoder, we simply adopted the similar structure of the encoder in ESPnet. Since our model is more efficient compared to ESPnet, the number of parameters is significantly reduced. The decoder concatenates the feature maps of the encoders of each stage to recover the original size, and also replaces the existing ESP module in the decoder with our CCC module.

We note that our CCC block can be adapted not only to ESPnet but also to other deep learning network based on dilated convolution for making a more efficient model. By changing the dilated convolutions in the segmentation models to the proposed CCC block, we can reduce the number of parameters and computational cost with enhancing the segmentation performance compared to the original network. Detailed results will be provided in Section


HFF Dilate Ratio Param(M) Flop(G) mIOU
(1) Baseline ESPnet O 0.364 9.07 60.74
(2) Depth-wise separable ESPnet O 0.128 4.93 57.81
(3) wo Concenrated-stage (Structure A) X 0.152 5.44 54.24
(4) wo Concenrated-stage++ (Structure A) X 0.187 6.26 55.22
(5) with regular dw-Conv Concentration stage (Structure B) X 0.580 15.68 58.56
(6) Our proposed CCC with dw-Asym (structure C) X 0.194 6.40 59.57
(7) Our proposed CCC with dw-AsymNL (structure C) X 0.198 6.45 60.98
Table 3:

CCC: Concentrated-Comprehensive Convolution. ++: increase the number of layer in encoder structure for similar parameters and flops. dw-Conv: depth-wise convolution. dw-Asym: depth-wise asymmetric convolution. dw-AsymNL: insert Batch normalization and PReLU between depth-wise asymmetric convolution. All results are reproduced in the PyTorch framework under same data augmentation and setting. The input size is

for calculating flops.

4 Experiment

We evaluated the proposed method and performed ablation study on CityScape dataset [6], which is widely used in semantic segmentation. All the performances were measured using mean intersection over union (mIOU), the number of parameters, and flops. To show the flexibility of our CCC, we performed a classification task on ImageNet dataset [17] using the CCC block.

4.1 Evaluation on Cityscapes

Experimental Setting: Cityscape dataset consists of multiple classes about urban street scenes. The number of training and validation images are 2,975 and 500, respectively. To train the model, we followed the standard data augmentation strategy[6] including random scaling, cropping and flipping. The learning rate was set to and multiplied by

at each 150 and 200 epochs (total 300 epochs). Adam optimizer

[11] with momentum of and the weight decay was used for the training. Other further experiments (CCC-ERFnet and CCC-DRN) followed the settings in the respective original papers [16, 27].

Experimental Result on Cityscape Dataset: Table 2 and Figure 5 show the evaluation results of the baseline segmentation networks, with switching the dilated convolutional block of the models to the proposed CCC block. Both of CCC1 and CCC2 use ESPnet as a baseline network with varying dilation ratio , which is and , respectively. We note that the HFF denoising module used in the original ESP is not used in CCC1 and CCC2. CCC-ESPnet used all the settings including HFF module and a convolution of dilated ratio , but the mIOU was not that different from CCC1 and CCC2. The results show that CCC2 outperforms CCC1 about with less parameters. This supports the proposition of the previous work [25], that the dilation ratios should be coprime.

ESPnet-tiny model is a smaller version of ESPnet from the original paper, which has similar number of parameter to the proposed CCC1 and CCC2. The result shows that both CCC1 and CCC2 outperform ESPnet-tiny with significant margins, for CCC1 and for CCC2. On Nvidia-TX2 board with resolution, CCC1 and CCC2 each marked FPS and FPS. Thus, both can be regarded as real-time methods. .

To prove the general applicability of our CCC block, we conducted further experiments with the other segmentation models: ERFnet, DRN-A50, DRN-C26, which are based on dilated convolution. For every model, our method achieved comparable or higher performance with the reduced number of parameters and flops. Especially, CCC-ERFnet showed more than 1% performance enhancement with 30% reduced parameters. CCC-DRN-A50 reduced the number of parameters and flops by about from the baseline DRN-A50, with slight performance drop (). In CCC-DRN-C26 case, the number of parameters and flops were reduced by more than from the DRN-C26, but the performance drop was only . The overall results show that the proposed CCC block can be incorporated with diverse segmentation models which use dilated convolution block, without much performance degradation.

(a) Img and GT (b) structure-A (c) structure-B (d) structure-C (e) structure-C with NL
Figure 6: The visualized feature map and segmentation result according to Figure 3 (d) and (e) is our proposed CCC structure. (e) is improved non-linearlity from (d) by using batch normalization and PReLU. From (d) to (e), unnecessarily activated parts are suppressed and the consistency of segmentation result is improved. Also, the less activated part is enhanced, hence the car is completely segmented.

Ablation Study on CCC: In this section, we show ablation results of the proposed CCC block based on Figure 3. The experiments uses the network based on ESPnet as introduced in section 3. Experiment (1) is about original ESPnet with dilated ratio . Experiment (2) uses the same network structure with the original ESPnet, but the standard dilated convolutions are substituted to depth-wise separable dilated convolutions (ds-Dilate). Experiments (3)-(7) are ablation studies of the proposed method. We use dilated ratio to the experiments (3)-(7). In the experiments, batch normalization was added between the concentration stage and comprehensive convolution stage, and HFF degridding module was not used. Experiment (3) shows the result only with ds-Dilate without adding the information concentration stage. Experiment (4) is a same ablation test to the Experiments (3) with larger number of layers to have similar parameter size and flops to the proposed method (Experiment-(6)). Experiment (5) is structure B in Figure 3. The information concentration stage consists of depth-wise convolution (dw-Conv) in Experiment (5). Experiment (6) is the proposed CCC block (structure C) in Figure 3

. This model reduces the number of parameters and flops by factorizing dw-Conv in Experiment (5) to depth-wise asymmetric convolution (dw-Asym). Experiment (7) is the version inserting the batch normalization and pReLU between the two dw-Asym in experiment (6). ESPnet used pReLU for acivation function, so we also use pReLU instead of ReLU between dw-Asym.

As shown in Table 3 (2)-(4), a naive usage of the depthwise separable architecture brought significant degradation of the performance (about 3%), and HFF module could not fully resolve the performance degradation. From the experiments (3) and (4), we can conclude that the information concentration stage is critical for resolving the accuracy drop from ds-Dilate, regardless of the number of parameters. Experiment (5) shows that we can simply use depth-wise convolution at the information concentration stage, and this can achieve better performance than those using HFF. However, the number of parameter is quite large in this case. In Experiment (6), we showed that the proposed model drastically reduce the number of parameters and flops and could achieve the better performance than that of Experiment (5), as well. From experiment (7), we found that adding non-linearity between the asymmetric convolution can marginally enhance the performance about to that of experiment (6).

Figure 6 is visualized output feature maps of various structures which are obtained by averaging the feature maps of different channels. The visualized feature maps of structure-A are not clear and has serious grid effect as shown in Figure 6 (b). Whereas, by proceeding information concentration stage, it improves the quality of feature maps (see (c)-(e)). Note the artifacts in structure-C (Figure 6(d)) have almost removed by adding nonlinearity between the asymmetric convolutions (Figure 6(e)).

4.2 Evaluation on Classification

Model Param Flop(G) Top1 Top5
DRN A-34 21.8 17.33 75.19 91.26
DRN A-50 25.6 19.15 77.06 93.43
DRN C-26 21.1 17.0 75.14 92.45
DRN C-42 31.2 25.1 77.06 93.43
CCC-DRN A-50 18.8 13.85 76.51 93.02
CCC DRN C-26 7.85 6.59 73.43 91.33
CCC DRN C-28* 13.11 10.72 74.39 91.99
CCC DRN C-42 9.63 8.17 74.96 92.31
CCC DRN C-44* 14.90 12.3 75.64 92.63
Table 4: ImageNet classification accuracy of DRN and our CCC-DRN. * denotes increased residual blocks from the basic model in the right above.

Experimental Setting: The evaluation is tested on ImageNet-1k [17] dataset. Training was performed with SGD with momentum 0.9 and weight decay . Learning rate was initially set to and is reduced by a factor of 10 every 30 epochs. The training was until 120 epochs with the same data augmentation method following DRN [27].

(a) Image (b) DRN C-26 (c) CCC DRN C-44
Figure 7: The visualized feature map and segmentation result according to ImageNet classification

ImageNet 1k Benchmark Results: From the experiments on the segmentation task, we have shown that the proposed CCC block can successfully substitute the dilated convolutional block. One step further from segmentation, we tested the proposed block to the classification networks which incorporate the dilated convolutional scheme to skip-connection (DRN) [27]. The result in Table 4 shows that the proposed method can achieve comparable classification accuracy considering that the parameter size was reduced.

For example, the CCC-DRN-C-42 reduced the parameters and flops by more than half compared to the DRN-C-26, but the top1 accuracy dropped only 0.18%. Also CCC DRN C-44* reduced the number of parameters by 30% and the number of flop by half, but better accuracy than DRN-C26. The low resolution feature map degrades the classification accuracy and localization power according to [27]. DRN showed that adding dilated convolution strengthen the localization power and hence can enhance the classification accuracy without adding further parameters. By changing the dilated convolutional parts in DRN to the proposed CCC block, we tested whether our method can substitute the dilated convolution in this case, as well.

We visualized the dense pixel-level class activation map from the DRN and our CCC-DRN to show each model’s localization power in Figure 7. From the figure, we can see that the proposed method generated the activation maps with better resolution than DRN, which means that our method also have enough localization capacity to DRN.

5 Conclusion

In this work, we proposed Concentrated-Comprehensive Convolutions (CCC) block for lightweight semantic segmentation. Our CCC block comprises of the information concentration stage and the comprehensive convolution stage both of which are based on two consecutive convolutional blocks. More specifically, the former block improves the local consistency by using two depth-wise asymmetric convolutions, and the latter one increases a receptive field by using a depth-wise separable dilated convolution. Throughout the extensive experiments, it turns out that our proposed method could integrate local and global information effectively and reduce the number of parameters and flops significantly. Furthermore, CCC block is generally applicable to other semantic segmentation models and can be adopted to other tasks such as classification.

Layer Operation Flop
Average Pooling
Bilinear upsampling
Batch normalization
Table 5: The detail method for calculating flops


6 Appendix

6.1 Flops Calculation

Table 5 shows how we calculated flops for each operation. The following notations are used.
: A input feature map
: A output feature map
: A convolution kernel
: A height of convolution kernel
: A width of convolution kernel
: A height of input feature map
: A width of input feature map
: A input channel dimension of feature map or kernel
: A output channel dimension of feature map or kernel
: A group size for channel dimension
: A height of output feature map
: A width of output feature map

: A non-linear activation function