A challenging task in the realm of computer vision is semantic segmentation, where the goal is to assign a class label (e.g., road, car, person, etc.) to each pixel of an image. A lot of recent successes in the realm of semantic segmentation has centered around deep learning(LeCun et al., 2015), particularly leveraging deep convolutional neural networks to learn the mapping between input images and output semantic segmentation label maps. Some notable state-of-the-art deep convolutional neural network architectures previously proposed in research literature include RefineNet (Lin et al., 2017), TuSimple (Wang et al., 2017), PSPNet (Zhao et al., 2017), and the DeepLab family of networks (Chen et al., 2018a, 2017, b).
Despite these significant advances in deep convolutional neural networks for the task of semantic segmentation over recent years, the high architectural and computational complexities of such networks pose a big challenge for the widespread deployment in practical, on-device edge scenarios such as on mobile devices, drones, and vehicles where computational, memory, bandwidth, and energy resources are very limited. Therefore, one is motivated to investigate the design of compact deep convolutional neural networks for semantic segmentation tailored for such low-power edge scenarios.
A number of interesting strategies have been proposed in research literature for producing compact deep neural networks that are more catered for low-power on-device usage. These strategies include precision reduction (Jacob et al., 2017; Meng et al., 2017; Courbariaux et al., 2015), model compression (Han et al., 2015; Hinton et al., 2015; Ravi, 2017), architectural design principles (Howard et al., 2017; Sandler et al., 2018; Iandola et al., 2016; Shafiee et al., 2017; Wong et al., 2018b; Zhang et al., 2017; Ma et al., 2018; He et al., 2015). More recently, an interesting new strategy explored by researchers is the notion of fully automated network architecture search for algorithmically exploring compact deep neural network architecture designs that are better suited for on-device edge and mobile usage. Exemplary automated network architecture search strategies in this direction include MONAS (Hsu et al., 2018), ParetoNASH (Elsken et al., 2018), and MNAS (Tan et al., 2018), which take computational constraints into account during the search process.
In this study, we introduce EdgeSegNet, a compact deep convolutional neural network for the task of semantic segmentation. This is accomplished via a human-machine collaborative design strategy, where human-driven principled network design prototyping is coupled with machine-driven design exploration. Such an approach leads to customized module-level macroarchitecture and microarchitecture designs tailored specifically for semantic segmentation in low-power edge scenarios.
The network architecture of EdgeSegNet network for semantic segmentation. The underlying architecture is comprised of a heterogeneous mix of residual bottleneck macroarchitectures and non-residual bottleneck macroarchitectures with unique module-level microarchitecture designs. Also notable are selective use of long-range shortcut connectivity, and aggressive reduction via strided convolutions.
Here, we introduce EdgeSegNet, a compact deep convolutional neural network for semantic segmentation that was created via a human-machine collaborative design strategy (Wong et al., 2019). To leverage this human-machine collaborative design strategy for building EdgeSegNet, we first perform principled network design prototyping to construct an initial design prototype to act as the base framework. Next, we conduct machine-driven design exploration based on this initial design prototype along with accompanying data and design requirements. We will now discuss each of these design stages, followed by the EdgeSegNet architecture design.
2.1 Principled network design prototyping
At the principled network design prototype stage in creating EdgeSegNet, we construct an initial semantic segmentation network design prototype (denoted as ) based on human-driven design principles to act a guide for the machine-driven design exploration phase. Inspired by the design principles for building networks for the task of semantic segmentation proposed in (Lin et al., 2017), we construct the initial design prototype with a multi-path refinement network architecture that enables improved high-resolution prediction by leveraging long-range shortcut connections. Such long-range shortcut connections enable the high-level semantic modeling in the deep layers to be refined based on fine-grained modeling in the earlier layers.
More specifically, the initial multi-path refinement design prototype for semantic segmentation used in this study is comprised of a number of feature representation modules, with shortcut connections between the modules. Refine modules are interspersed between these feature representation modules to enable outputs of the deep layers to be refined based on that of earlier layers. The actual macroarchitecture and microarchitecture designs of the individual network modules in the semantic segmentation network architecture are left flexible in order for the machine-driven design exploration phase to determine automatically based on the given dataset along with human-specified design requirements catered for on-device edge scenarios where computational and memory complexity are highly limited.
2.2 Machine-driven design exploration
Given the initial network design , the module-level macroarchitecture and microarchitecture designs of the proposed EdgeSegNet network architecture is then determined via a machine-driven design exploration stage in our design process based on the segmentation data at hand as well as human-specified requirements. This machine-driven design exploration stage ensures that the generated microarchitecture and macroarchitecture designs produced by machine-driven design exploration are well-suited for on-device semantic segmentation for edge scenarios.
For the purpose of machine-driven design exploration, it is accomplished in the form of generative synthesis (Wong et al., 2018a) to determine fine-grain macroarchitecture and microarchitecture designs of the individual network modules of the EdgeSegNet network architecture based on data and human-specified design requirements and constraints. The underlying premise behind generative synthesis is to learn a generator that, given a set of seeds , can generate networks that maximize a universal performance function (e.g., (Wong, 2018)) while satisfying requirements defined via an indicator function . This can be formulated as a constrained optimization problem,
An approximate solution to the constrained optimization problem posed in Eq. 1 can be obtained via iterative optimization, with the initial solution (i.e., ) initialized based on , , and , and each successive solution achieving a higher than its predecessor generators (i.e., , , , etc.) while constrained by . The resulting solution can be thus used to generate the final EdgeSegNet network that satisfies .
Here, we configure the indicator function such that the accuracy 88% on Cambridge-driving Labeled Video Database (CamVid) (Brostow et al., 2008), a dataset introduced for evaluating semantic segmentation with 32 different semantic classes, so that it is within 3% of ResNet-101 RefineNet (Lin et al., 2017), a state-of-the-art network.
3 EdgeSegNet Architectural Design
The network architecture of the proposed EdgeSegNet for semantic segmentation is shown in Fig. 0(a). A number of interesting observations can be made about the module-level macroarchitecture design of the customized modules of EdgeSegNet that was created via a human-machine collaborative design strategy.
3.1 Macroarchitecture heterogeneity
The most obvious and notable observation about the proposed EdgeSegNet network architecture is that it is comprised of a heterogeneous mix of residual bottleneck macroarchitectures with shortcut connections and non-residual bottleneck macroarchitectures. The use of bottleneck macroarchitectures enables channel dimensionality to be decreased at a compression convolutional layer using 11 convolutions before being restored at a later convolutional layer, thus reducing the architectural and computational complexity of the network while preserving modeling performance.
3.2 Selective long-range shortcut connectivity
The second notable observation about the proposed EdgeSegNet network architecture is that long-range shortcut connections only exist for a subset of possible combinations of layers, leading to only some of the high-level semantic modeling at the deep layers being refined based on fine-grained modeling at the earlier layers. Not only does this reduction in long-range shortcut connectivity reduce architectural complexity of the network, but also may indicate that there may only be benefits to refining certain scales.
3.3 Aggressive reduction via strided convolutions
The third notable observation about the proposed EdgeSegNet network architecture is that the non-residual bottleneck reduction module macroarchitecture leverages 88 strided convolutions, and as such achieves very aggressive reduction of spatial dimensionality into the next layer. This dimensionality reduction property of the non-residual bottleneck reduction module macroarchitecture significantly reduces architectural and computational complexity of the network.
4 Results and Discussion
The efficacy of the proposed EdgeSegNet for semantic segmentation in on-device edge scenarios was evaluated using the Cambridge-driving Labeled Video Database (CamVid) (Brostow et al., 2008), a dataset introduced for evaluating performance of deep neural networks for semantic segmentation with 32 different semantic classes. Furthermore, we report the model size as well as the inference speed on an NVidia Jetson AGX Xavier module. For comparison purposes, the results for ResNet-101 RefineNet (Lin et al., 2017), a state-of-the-art semantic segmentation network, are also presented.
|Model||Acc (%)||Speed (FPS)||Size (Mb)|
Computed on NVidia Jetson AGX Xavier
Too large to run due to insufficient memory
As shown in Table 1, the proposed EdgeSegNet achieved similar accuracy compared to ResNet-101 RefineNet (difference of just 0.6%), but is 20 smaller in terms of model size compared to RefineNet. More interestingly, EdgSegNet achieved an inference speed of 38.5 FPS on an NVidia Jetson AGX Xavier module running at 1.37GHz with 512 CUDA cores, while RefineNet was too large to run due to insufficient memory (for context, RefineNet runs at just 28 FPS on an NVidia GTX 1080Ti running at 1.4 GHz with 3584 CUDA cores). An example semantic segmentation label map produced using EdgeSegNet on a CamVid video is shown in Fig. 2. It can be observed that strong visual segmentation results can be achieved using the proposed EdgeSegNet.
The results of the experiments demonstrate that the proposed EdgeSegNet was able to achieve state-of-the-art performance while being noticeably smaller and requiring significantly fewer computations. As such, EdgeSegNet is well-suited for the purpose of semantic segmentation in on-device edge and mobile scenarios where resources are very limited yet the speed of inference needs to be fast.
- Brostow et al. (2008) Brostow, G. et al. Semantic object classes in video: A high-definition ground truth database. In PRL, 2008.
- Chen et al. (2017) Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H. Rethinking atrous convolution for semantic image segmentation. In arXiv:1706.05587, 2017.
- Chen et al. (2018a) Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018a.
- Chen et al. (2018b) Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018b.
- Courbariaux et al. (2015) Courbariaux, M., Bengio, Y., and David, J.-P. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131, 2015.
- Elsken et al. (2018) Elsken, T., Metzen, J. H., and Hutter, F. Multi-objective architecture search for cnns. arXiv preprint arXiv:1804.09081, 2018.
- Han et al. (2015) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
- He et al. (2015) He, K. et al. Deep residual learning for image recognition. arXiv:1512.03385, 2015.
- Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Hsu et al. (2018) Hsu, C.-H., Chang, S.-H., Juan, D.-C., Pan, J.-Y., Chen, Y.-T., Wei, W., and Chang, S.-C. Monas: Multi-objective neural architecture search using reinforcement learning. arXiv preprint arXiv:1806.10332, 2018.
- Iandola et al. (2016) Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
- Jacob et al. (2017) Jacob, B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference. arXiv:1712.05877, 2017.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
Lin et al. (2017)
Lin, G., Milan, A., Shen, C., and Reid, I.
Refinenet: Multi-path refinement networks for high-resolution
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1925–1934, 2017.
- Ma et al. (2018) Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131, 2018.
- Meng et al. (2017) Meng, W. et al. Two-bit networks for deep learning on resource-constrained embedded devices. arXiv:1701.00485, 2017.
- Ravi (2017) Ravi, S. ProjectionNet: Learning efficient on-device deep networks using neural projections. arXiv:1708.00630, 2017.
- Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.
- Shafiee et al. (2017) Shafiee, M. J., Li, F., Chwyl, B., and Wong, A. Squishednets: Squishing squeezenet further for edge device scenarios via deep evolutionary synthesis. NIPS Workshop on Machine Learning on the Phone and other Consumer Devices, 2017.
- Tan et al. (2018) Tan, M., Chen, B., Pang, R., Vasudevan, V., and Le, Q. V. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
- Wang et al. (2017) Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., and Cottrell, G. Understanding convolution for semantic segmentation. In Proceedings of WACV, 2017.
- Wong (2018) Wong, A. Netscore: Towards universal metrics for large-scale performance analysis of deep neural networks for practical usage. arXiv preprint arXiv:1806.05512, 2018.
- Wong et al. (2018a) Wong, A., Shafiee, M. J., Chwyl, B., and Li, F. Ferminets: Learning generative machines to generate efficient neural networks via generative synthesis. Advances in neural information processing systems Workshops, 2018a.
- Wong et al. (2018b) Wong, A., Shafiee, M. J., Li, F., and Chwyl, B. Tiny ssd: A tiny single-shot detection deep convolutional neural network for real-time embedded object detection. Proceedings of the Conference on Computer and Robot Vision, 2018b.
- Wong et al. (2019) Wong, A., Lin, Z. Q., and Chwyl, B. Attonets: Compact and efficient deep neural networks for the edge via human-machine collaborative design. arXiv preprint arXiv:1903.07209, 2019.
- Zhang et al. (2017) Zhang, X., Zhou, X., Lin, M., and Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In arXiv:1707.01083, 2017.
- Zhao et al. (2017) Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. Pyramid scene parsing network. In Proceedings of WACV, 2017.