1 Introduction
(a) Input image 
(b) Ground Truth 
(c) Result of ESPnet (Param : 0.364M) 
(d) Result of dsDilate (Param : 0.187M) 
(e) Result of our CCC block (Param : 0.198M) 
Deep networkbased semantic segmentation algorithms largely have enhanced the performance, but suffer from heavy computation. It is basically because the segmentation task involves in pixelwise classification and also because many recent models are designed to have a large receptive field. To resolve this problem, lightweight segmentation models have been actively studied recently.
The recent segmentation studies show that dilated convolution which expands the receptive field without increasing the number of parameters can be an important technique for achieving the goal. Using the dilated convolution, many researches have made efforts to increase the speed [16, 14, 15, 20]. Among them, ESPnet [18] works fast by adopting a welldesigned series of spatial pyramid dilated convolutions.
To further reduce the number of parameters and the amount of calculation in the dilated convolution, combining depthwise (channelwise) separable convolutions with dilated convolutions is one of the popular methods [5, 9]. However, this combination commonly degrades performance as shown in Figure 1.
Figure 1 (c) and (e) are the results of ESPnetbased networks with the same encoderdecoder structure while (d) increases the number of layers in the encoder structure for similar complexity with (e). They all use different convolution blocks. Figure 1 (d) is obtained using simple depthwise separable dilated convolution blocks, which show a severe grid effect. On the other hand, our proposed method greatly improves the segmentation quality without any grid effect as shown in Figure 1 (e).
In this paper, we propose a new lightweight segmentation model which can achieve realtime execution time in an embedded board (NvidiaTX2 board) as well as comparable accuracy compared to stateoftheart lightweight segmentation algorithms. To achieve this goal, we describe a new concentratedcomprehensive convolution (CCC) block that is composed of two depthwise asymmetric convolutions and a depthwise dilated convolution, which can replace any normal dilated convolution operation. Our CCC block incorporates a balance between local and global consistency without requiring enormous complexity. Also our method has all the advantages of both the depthwise separable convolution and the dilated convolution.
Our segmentation model based on CCC blocks reduces the number of parameters by almost half and the computational cost in flops by 35% compared to the original ESPnet. The proposed CCC block can substitute arbitrary segmentation networks that use the dilated convolution for reducing complexity with a comparable performance. Qualitatively, we prove that the proposed convolutional block can alleviate the grid effect which is reported to be one of main reasons for performance degradation in segmentation. Furthermore, we also prove that our method can be applied flexibly to other tasks such as classification as well.
2 Related Work
As deep convolutional networks require a lot of convolution operations, there are restrictions on using them in realtime applications. To resolve this problem, convolution factorization methods have been applied.
Convolution Factorization: Convolution factorization divides a convolution operation into several stages to reduce the computational complexity of regular convolutions. In Inception [22, 23, 21] used convolution to reduce the number of channels. Also, decreasing the amount of computation more, a convolution with a large kernel size is divided into a combination of smallsized convolutions while keeping the size of the receptive field constant. Xception [5] and MobileNet [9] use the depthwise separable convolution (dsConv), which performs spatial and crosschannel operations separately in the calculation of convolution. The dsConv greatly reduces the amount of computation without a large drop in performance. MobileNetV2 [19] is further developed by applying an inverted residual block to MobileNet [9]. In addition, ResNeXt [26] which has ResNet [7] as a basic structure, has shown good performance by applying a group convolution on certain channels. ShuffleNet [28] applied the pointwise group convolution, which performs an convolution on specific channels only rather than applying it to all channels, to reduce computation.
Dilated Convolution: Dilated convolution [8] inserts holes between the consecutive kernel pixels to obtain information in a wider scale. Dilated convolution is a widely used method in segmentation due to a large receptive field without increasing the amount of computation and parameters. In DeepLabV2 and DeepLabV3 [2, 3], atrous spatial pyramid pooling (ASPP) was introduced using the pyramid structure of dilated convolution to deal with multiscale information. DenseASPP [13] repeats the concatenation of a set of different dilaterated convolutional layers to generate more dense multiscale feature representation. DRN [27] showed good performance in segmentation and classification compared to ResNet [7] by adding dilated convolution in ResNet [7]
and modifying the residual connection.
Recent researches start to combine dilated convolution with dsConv. In machine translation, [10]
proposed a superseparable convolution that divides the depth dimension of a tensor into several groups and performs dsConv for each. However, they did not solve performance degradation from combining the dilated convolution and the dsConv. DeepLabv3+
[4] applies the dsConv to the ASPP [2] and a decoder module based on an Xeception model [5].Lightweight Segmentation: Enet [14] is the first architecture designed for realtime segmentation. Since then, ESPNet [18] has led to more improvements than ever before in both the speed and the performance. It has introduced a lightweight and accurate network by applying a pointwise convolution to reduce the number of channels and then, using a spatial pyramid of dilated convolutions. RTSEG [20] proposed the metaarchitecture which assembles several existing networks into an encoderdecoder structure. ERFNet [16] used residual connections and factorized a dilated convolution. The authors of [24] proposed a new training method which gradually changes a dense convolution to a grouped convolution. They applied a new method to ERFNet [16] and led to a times improvement in speed . ContextNet [15] designed two network branches for global context and detail information.
Previously, segmentation studies focused on accuracy, but researches on lightweight segmentation are becoming more active. Most models use dilated convolution and dsConv properly. However, a simple combination of these induces serious degradation in accuracy. Our method resolves this problem by replacing the dilated convolutions with the newly proposed concentratedcomprehensive convolutions (CCC) blocks. We achieved realtime segmentation on an embedded board without performance degradation because CCC block takes all the advantages of the dilated convolution and the dsConv. The proposed CCC is not restricted to a specific model but has a flexibility to be applied to other models based on dilated convolutions.
3 Method
Combining dilated convolution and depthwise separable convolution sometimes induces degradation of accuracy. This is due to the skipping of neighboring information. To resolve the problem, we propose a concentratedcomprehensive convolutions (CCC) block that maintains the segmentation performance without immense complexity. Figure 3 shows the detailed setting of the CCC block, In Figure 4, the overall structure of a CCCESP module based on ESPnet [18] is described. Furthermore, we analyze the reason for the disadvantage of depthwise separable dilated (dsDilated) convolution in Section 3.1. Section 3.2 explains how the CCC block solves the problems mentioned in Section 3.1 with an asymmetric convolution. We extend the CCC block into a segmentation network based on ESPnet in Section 3.3.
3.1 Issues on Depthwise Separable Dilated Convolution
(a) Local information missing in dilated convolution 
(b) Noise from irrelevant area in dilated convolution. 
Dilated convolution is an effective variant of the traditional convolution operator for semantic segmentation in that it creates a large receptive field without decreasing the resolution of the feature map. However, using the dilated convolution still requires heavy computational costs that prohibits it from being used in a lightweight model. Therefore, we adopt the key idea in the depthwise separable convolution (dsConv) . Specifically, a dilated convolutional layer takes a convolutional operation on an input feature map using a convolutional kernel with a dilate ratio . The terms and are the number of input and output channels, the height and the width of the feature map, and the height and the width of the kernel, respectively. Using the kernel , we compute the output feature map with a dilate ratio as
(1) 
Then, we can apply the depthwise convolution to the dilated convolution as
(2) 
Note that the process does not calculate crosschannel multiplication, but takes only the spatial multiplication. Hence, the kernel is for the depthwise dilated convolution in the spatial dimension, and denotes a kernel for an pointwise convolution. Then, the parameter size is reduced from to when the kernel size is , i.e., . The number of floating point operations (flop) is also largely reduced from to . However, we can observe that this approximation leads significant performance degradation compared to the case of using the original dilated convolution.
We should also note that the dilated convolution has inherent risks as shown in Figure 2. First, local information can be missing. Since the dilated convolutional operation is spatially discrete according to the dilate ratio, the information loss is inevitable. Second, an irrelevant area across large distances affects the result of segmentation. Dilated convolution includes area far away from the target area in calculating the convolution, while impairs local consistency as well. For example, the target area in red square is misclassified due to the green squares beneath as in Figure 2(b).
Also, unlike standard convolution, a dsConv is independent among channels, and hence spatial information of other channels does not directly flow. From the observation in Figure 1 (d) , we conjecture that the loss of crosschannel information triggers the mentioned risks of dilated convolution, and degrades the performance.
3.2 ConcentratedComprehensive Convolution
We propose a concentratedcomprehensive convolutions (CCC) to compensate for the segmentation performance degradation based on the observations mentioned in the previous section. CCC block consists of information concentration stage and comprehensive convolution stage. The information concentration stage aggregates local feature information by using a simple convolutional kernel. The comprehensive convolution stage uses dilation to see the large receptive field, followed by the pointwise convolution that mixes the channel information. It is noted that we apply the depthwise convolution to the dilated convolution to further reduce the parameter size.
As in Figure 2(a), feature information of white pixels in the region of centered on the red pixel will be lost when executing dilated convolution with dilation . The information concentration stage alleviates this loss of feature information by executing simple depthwise convolution before dilated convolution, which compresses the skipped feature information as shown in Figure 3B.
We note that as the dilate ratio increases, the depthwise convolution in information concentration stage becomes extremely inefficient. In most cases, the dilate ratio is up to 16, so it becomes intractable in an embedded system. Specifically, when we use regular depthwise convolutions to a feature map with channels, the number of parameters is . When the size of the output of feature map is , the number of flops becomes . However, if the convolution kernel is separable, the kernel can be decomposed as . For an kernel, this separable convolution reduces the computational complexity per pixel from to . We solve the complexity problem by using two depthwise asymmetric convolutions instead of a regular depthwise convolution as shown in Figure 3C.
Comprehensive convolution stage uses depthwise dilated convolution to widen a receptive field for global consistency. After that, we execute crosschannel operation with an pointwise convolution. Since CCC block shown in Figure 3C is a kind of dsConv, a little number of additional parameters and calculation are required. At the same time, CCC block also makes enough receptive field for segmentation from the comprehensive convolution stage. In summary, CCC block combines both advantages of the depthwise separable convolution and the dilated convolution by integrating local and global information properly. Therefore, although segmentation network based on CCC block has fewer parameters and computational complexity than the original network, the proposed block can achieve good enough segmentation performance.
3.3 Segmentation Network with CCC Module
In this section, we describe our network design with CCC blocks. Figure 4(a) shows our proposed CCC module with a parallel structure which is modified from the ESP module [18] shown in Figure 4(b). The original ESP module, which has a parallel structure of spatial pyramid of dilated convolutions, each of which has a dilate ratio , firstly reduces the the number of channels in the input feature map by and then applies each dilated convolution to the reduced input feature map. In addition, outputs of dilated convolutions are elementwise summed hierarchically.
ESP module  CCC module  
Param  Flop(M)  Param  Flop(M)  
A  3,200  104.9  4,096  134.2 
B  28,800  943.7  9,472  299.9 
C    0.066     
D    0.016    0.016 
Total  31,325  1,048.8  13,568  434.13 
Likewise, when the number of CCC blocks is , we firstly reduce the number of channels by and then, we apply parallel CCC blocks to the reduced feature map. Unlike other studies [13, 18, 27], we simply concatenate the feature maps without additional postprocessing. Furthermore, ESPnet set including the operation of dilate ratio of , but we exclude it to reduce computation. This exclusion is reasonable because in our CCC block, information concentration stage gathers neighboring information. Therefore, we set to view multiple levels of spatial information. When the largest dilate ratio is , the receptive field of the module is in a kernel as in the case of ESP module. The full comparison between the ESP module and our module is shown in the Table 1.
Method  Extra dataset  Param(M)  Flop(G)  Class mIOU 

FCN8S (Long, Shelhamer, and Darrell 2015) [12]  ImageNet  134.5  970.15  65.3 
Segnet (Badrinarayanan, Kendall, and Cipolla 2015) [1]  ImageNet  29.5  163.1  57.0 
DeepLabV3+ (Chen et al. 2018) [4]  ImageNet  41.06  280.08  82.1 
DenseASPP121 (Yang et al. 2018) [13]  ImageNet  8.28  155.82  76.2 
DRNA50 (Yu, Koltun, and Funkhouser 2017) [27]  ImageNet  23.55  400.15  67.3 
DRNC26 (Yu, Koltun, and Funkhouser 2017) [27]  ImageNet  20.62  355.18  68.0 
CCCDRNA50  ImageNet  16.79  289.28  67.12 
CCCDRNC26  ImageNet  7.34  137.44  67.6 
Enet (Paszke et al. 2016) [14]  no  0.364  8.52  58.3 
SkipSuffle (Siam et al. 2018) [20]  no  1.0  6.22  58.3 
SkipMobile (Siam et al. 2018) [20]  no  3.37  15.38  62.4 
ContextNetcn12 (Poudel et al. 2018) [15]  no  0.85  32.68  68.7 
ERFnet (Romera et al. 2017) [16]  no  2.1  53.48  68.0 
CCCERFnet  no  1.45  43.34  69.01 
ESPnet (Mehta et al. 2018) [18]  no  0.364  9.67  60.3 
ESPnettiny (Mehta et al. 2018) [18]  no  0.202  6.48  56.3 
CCC1 (d=2, 4, 8, 16)  no  0.198  6.45  60.98 
CCC2 (d=2, 3, 7, 13)  no  0.192  6.29  61.96 
CCCESPnet  no  0.21  6.75  61.06 
As in ESPnet, the overall network adopted the feature resampling method that concatenates the feature map with the one obtained by reducing the resolution of the original input image. For designing our encoder, we simply adopted the similar structure of the encoder in ESPnet. Since our model is more efficient compared to ESPnet, the number of parameters is significantly reduced. The decoder concatenates the feature maps of the encoders of each stage to recover the original size, and also replaces the existing ESP module in the decoder with our CCC module.
We note that our CCC block can be adapted not only to ESPnet but also to other deep learning network based on dilated convolution for making a more efficient model. By changing the dilated convolutions in the segmentation models to the proposed CCC block, we can reduce the number of parameters and computational cost with enhancing the segmentation performance compared to the original network. Detailed results will be provided in Section
4.HFF  Dilate Ratio  Param(M)  Flop(G)  mIOU  
(1)  Baseline ESPnet  O  0.364  9.07  60.74  
(2)  Depthwise separable ESPnet  O  0.128  4.93  57.81  
(3)  wo Concenratedstage (Structure A)  X  0.152  5.44  54.24  
(4)  wo Concenratedstage++ (Structure A)  X  0.187  6.26  55.22  
(5)  with regular dwConv Concentration stage (Structure B)  X  0.580  15.68  58.56  
(6)  Our proposed CCC with dwAsym (structure C)  X  0.194  6.40  59.57  
(7)  Our proposed CCC with dwAsymNL (structure C)  X  0.198  6.45  60.98 
CCC: ConcentratedComprehensive Convolution. ++: increase the number of layer in encoder structure for similar parameters and flops. dwConv: depthwise convolution. dwAsym: depthwise asymmetric convolution. dwAsymNL: insert Batch normalization and PReLU between depthwise asymmetric convolution. All results are reproduced in the PyTorch framework under same data augmentation and setting. The input size is
for calculating flops.4 Experiment
We evaluated the proposed method and performed ablation study on CityScape dataset [6], which is widely used in semantic segmentation. All the performances were measured using mean intersection over union (mIOU), the number of parameters, and flops. To show the flexibility of our CCC, we performed a classification task on ImageNet dataset [17] using the CCC block.
4.1 Evaluation on Cityscapes
Experimental Setting: Cityscape dataset consists of multiple classes about urban street scenes. The number of training and validation images are 2,975 and 500, respectively. To train the model, we followed the standard data augmentation strategy[6] including random scaling, cropping and flipping. The learning rate was set to and multiplied by
at each 150 and 200 epochs (total 300 epochs). Adam optimizer
[11] with momentum of and the weight decay was used for the training. Other further experiments (CCCERFnet and CCCDRN) followed the settings in the respective original papers [16, 27].Experimental Result on Cityscape Dataset: Table 2 and Figure 5 show the evaluation results of the baseline segmentation networks, with switching the dilated convolutional block of the models to the proposed CCC block. Both of CCC1 and CCC2 use ESPnet as a baseline network with varying dilation ratio , which is and , respectively. We note that the HFF denoising module used in the original ESP is not used in CCC1 and CCC2. CCCESPnet used all the settings including HFF module and a convolution of dilated ratio , but the mIOU was not that different from CCC1 and CCC2. The results show that CCC2 outperforms CCC1 about with less parameters. This supports the proposition of the previous work [25], that the dilation ratios should be coprime.
ESPnettiny model is a smaller version of ESPnet from the original paper, which has similar number of parameter to the proposed CCC1 and CCC2. The result shows that both CCC1 and CCC2 outperform ESPnettiny with significant margins, for CCC1 and for CCC2. On NvidiaTX2 board with resolution, CCC1 and CCC2 each marked FPS and FPS. Thus, both can be regarded as realtime methods. .
To prove the general applicability of our CCC block, we conducted further experiments with the other segmentation models: ERFnet, DRNA50, DRNC26, which are based on dilated convolution. For every model, our method achieved comparable or higher performance with the reduced number of parameters and flops. Especially, CCCERFnet showed more than 1% performance enhancement with 30% reduced parameters. CCCDRNA50 reduced the number of parameters and flops by about from the baseline DRNA50, with slight performance drop (). In CCCDRNC26 case, the number of parameters and flops were reduced by more than from the DRNC26, but the performance drop was only . The overall results show that the proposed CCC block can be incorporated with diverse segmentation models which use dilated convolution block, without much performance degradation.
(a) Img and GT  (b) structureA  (c) structureB  (d) structureC  (e) structureC with NL 
Ablation Study on CCC: In this section, we show ablation results of the proposed CCC block based on Figure 3. The experiments uses the network based on ESPnet as introduced in section 3. Experiment (1) is about original ESPnet with dilated ratio . Experiment (2) uses the same network structure with the original ESPnet, but the standard dilated convolutions are substituted to depthwise separable dilated convolutions (dsDilate). Experiments (3)(7) are ablation studies of the proposed method. We use dilated ratio to the experiments (3)(7). In the experiments, batch normalization was added between the concentration stage and comprehensive convolution stage, and HFF degridding module was not used. Experiment (3) shows the result only with dsDilate without adding the information concentration stage. Experiment (4) is a same ablation test to the Experiments (3) with larger number of layers to have similar parameter size and flops to the proposed method (Experiment(6)). Experiment (5) is structure B in Figure 3. The information concentration stage consists of depthwise convolution (dwConv) in Experiment (5). Experiment (6) is the proposed CCC block (structure C) in Figure 3
. This model reduces the number of parameters and flops by factorizing dwConv in Experiment (5) to depthwise asymmetric convolution (dwAsym). Experiment (7) is the version inserting the batch normalization and pReLU between the two dwAsym in experiment (6). ESPnet used pReLU for acivation function, so we also use pReLU instead of ReLU between dwAsym.
As shown in Table 3 (2)(4), a naive usage of the depthwise separable architecture brought significant degradation of the performance (about 3%), and HFF module could not fully resolve the performance degradation. From the experiments (3) and (4), we can conclude that the information concentration stage is critical for resolving the accuracy drop from dsDilate, regardless of the number of parameters. Experiment (5) shows that we can simply use depthwise convolution at the information concentration stage, and this can achieve better performance than those using HFF. However, the number of parameter is quite large in this case. In Experiment (6), we showed that the proposed model drastically reduce the number of parameters and flops and could achieve the better performance than that of Experiment (5), as well. From experiment (7), we found that adding nonlinearity between the asymmetric convolution can marginally enhance the performance about to that of experiment (6).
Figure 6 is visualized output feature maps of various structures which are obtained by averaging the feature maps of different channels. The visualized feature maps of structureA are not clear and has serious grid effect as shown in Figure 6 (b). Whereas, by proceeding information concentration stage, it improves the quality of feature maps (see (c)(e)). Note the artifacts in structureC (Figure 6(d)) have almost removed by adding nonlinearity between the asymmetric convolutions (Figure 6(e)).
4.2 Evaluation on Classification
Model  Param  Flop(G)  Top1  Top5 

DRN A34  21.8  17.33  75.19  91.26 
DRN A50  25.6  19.15  77.06  93.43 
DRN C26  21.1  17.0  75.14  92.45 
DRN C42  31.2  25.1  77.06  93.43 
CCCDRN A50  18.8  13.85  76.51  93.02 
CCC DRN C26  7.85  6.59  73.43  91.33 
CCC DRN C28*  13.11  10.72  74.39  91.99 
CCC DRN C42  9.63  8.17  74.96  92.31 
CCC DRN C44*  14.90  12.3  75.64  92.63 
Experimental Setting: The evaluation is tested on ImageNet1k [17] dataset. Training was performed with SGD with momentum 0.9 and weight decay . Learning rate was initially set to and is reduced by a factor of 10 every 30 epochs. The training was until 120 epochs with the same data augmentation method following DRN [27].
(a) Image  (b) DRN C26  (c) CCC DRN C44 
ImageNet 1k Benchmark Results: From the experiments on the segmentation task, we have shown that the proposed CCC block can successfully substitute the dilated convolutional block. One step further from segmentation, we tested the proposed block to the classification networks which incorporate the dilated convolutional scheme to skipconnection (DRN) [27]. The result in Table 4 shows that the proposed method can achieve comparable classification accuracy considering that the parameter size was reduced.
For example, the CCCDRNC42 reduced the parameters and flops by more than half compared to the DRNC26, but the top1 accuracy dropped only 0.18%. Also CCC DRN C44* reduced the number of parameters by 30% and the number of flop by half, but better accuracy than DRNC26. The low resolution feature map degrades the classification accuracy and localization power according to [27]. DRN showed that adding dilated convolution strengthen the localization power and hence can enhance the classification accuracy without adding further parameters. By changing the dilated convolutional parts in DRN to the proposed CCC block, we tested whether our method can substitute the dilated convolution in this case, as well.
We visualized the dense pixellevel class activation map from the DRN and our CCCDRN to show each model’s localization power in Figure 7. From the figure, we can see that the proposed method generated the activation maps with better resolution than DRN, which means that our method also have enough localization capacity to DRN.
5 Conclusion
In this work, we proposed ConcentratedComprehensive Convolutions (CCC) block for lightweight semantic segmentation. Our CCC block comprises of the information concentration stage and the comprehensive convolution stage both of which are based on two consecutive convolutional blocks. More specifically, the former block improves the local consistency by using two depthwise asymmetric convolutions, and the latter one increases a receptive field by using a depthwise separable dilated convolution. Throughout the extensive experiments, it turns out that our proposed method could integrate local and global information effectively and reduce the number of parameters and flops significantly. Furthermore, CCC block is generally applicable to other semantic segmentation models and can be adopted to other tasks such as classification.
Layer  Operation  Flop 

Convolution  
Deconvolution  
Linear  
Average Pooling  
Bilinear upsampling  
Batch normalization  
ReLU or PReLU 
References
 [1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. arXiv preprint arXiv:1511.00561, 2015.
 [2] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
 [3] L.C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
 [4] L.C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
 [5] F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, pages 1610–02357, 2017.

[6]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.
InProceedings of the IEEE conference on computer vision and pattern recognition
, pages 3213–3223, 2016.  [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 [8] M. Holschneider, R. KronlandMartinet, J. Morlet, and P. Tchamitchian. A realtime algorithm for signal analysis with the help of the wavelet transform. In Wavelets, pages 286–297. Springer, 1990.
 [9] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

[10]
L. Kaiser, A. N. Gomez, and F. Chollet.
Depthwise separable convolutions for neural machine translation.
In International Conference on Learning Representations, 2018.  [11] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [12] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
 [13] C. Z. Z. L. K. Y. Maoke Yang, Kun Yu. Denseaspp for semantic segmentation in street scenes. In CVPR, 2018.
 [14] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet: A deep neural network architecture for realtime semantic segmentation. arXiv preprint arXiv:1606.02147, 2016.
 [15] R. P. Poudel, U. Bonde, S. Liwicki, and C. Zach. Contextnet: Exploring context and detail for semantic segmentation in realtime. arXiv preprint arXiv:1805.04554, 2018.
 [16] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo. Erfnet: Efficient residual factorized convnet for realtime semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1):263–272, 2018.
 [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [18] A. C. L. S. Sachin Mehta, Mohammad Rastegari and H. Hajishirzi. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In ECCV, 2018.
 [19] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. arXiv preprint arXiv:1801.04381, 2018.
 [20] M. Siam, M. Gamal, M. AbdelRazek, S. Yogamani, and M. Jagersand. Rtseg: Realtime semantic segmentation comparative study. In IEEE International Conference on Image Processing (ICIP), pages 1603 – 1607, 2018.
 [21] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.
 [22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
 [23] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
 [24] N. Vallurupalli, S. Annamaneni, G. Varma, C. Jawahar, M. Mathew, and S. Nagori. Efficient semantic segmentation using gradual grouping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 598–606, 2018.
 [25] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell. Understanding convolution for semantic segmentation. In IEEE Winter Conf. on Applications of Computer Vision (WACV), 2018.

[26]
S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He.
Aggregated residual transformations for deep neural networks.
In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.  [27] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In Computer Vision and Pattern Recognition (CVPR), 2017.
 [28] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.
6 Appendix
6.1 Flops Calculation
Table 5 shows how we calculated flops for each operation.
The following notations are used.
: A input feature map
: A output feature map
: A convolution kernel
: A height of convolution kernel
: A width of convolution kernel
: A height of input feature map
: A width of input feature map
: A input channel dimension of feature map or kernel
: A output channel dimension of feature map or kernel
: A group size for channel dimension
: A height of output feature map
: A width of output feature map
: A nonlinear activation function
Comments
There are no comments yet.