In this paper, we propose a novel network design mechanism for efficient embedded computing. Inspired by the limited computing patterns, we propose to fix the number of channels in a group convolution, instead of the existing practice that fixing the total group numbers. Our solution based network, named Variable Group Convolutional Network (VarGNet), can be optimized easier on hardware side, due to the more unified computing schemes among the layers. Extensive experiments on various vision tasks, including classification, detection, pixel-wise parsing and face recognition, have demonstrated the practical value of our VarGNet.READ FULL TEXT VIEW PDF
Empowering embedded systems to run the well-known deep learning architectures, such as convolutional neural networks (CNNs), has been a hot topic in recent years. For smart Internet of Things applications, the challenging part is that the whole system is required to be both energy-constrained and of small size. To meet the challenge, the work of improving the efficiency of the whole computing process can be roughly broken into two directions: The first is to design lightweight networks which has a small MAddshoward2017mobilenets ; sandler2018mobilenetv2 ; zhang2018shufflenet ; ma2018shufflenet , thus friendly to low power consumption platforms; The second is to optimize hardware-side configurations, such as FPGA based accelerators FarabetPHL09 ; ZhangLSGXC15 , or to make the whole computing process more efficient by improving the compiler and generating more smart instructions abdelfattah2018dla ; chen2018tvm ; xing2019dnnvm .
All of the mentioned works above have demonstrated their great practical value in various applications. However, the real performance may not live up to the designer’s expectations, due to the gap between the two different optimization directions. Specifically, for elaborately tuned networks with small MAdds, the overall latency may be high ma2018shufflenet , while for carefully designed compilers or accelerators, the real networks may be hard to be processed.
In this work, we intend to close the exiting gap by systematically analyze the necessary properties of a lightweight network that is friendly to the embedded hardware and the corresponding compilers. More precisely, since the computation patterns of a chip in a embedded system is strictly limited, we propose that a embedded-system-friendly network should fit into the targeted computation patterns and also the ideal data layout. By fitting into the ideal data layout, we can reduce the communication cost between on-chip memory and off-chip memory, thus fully exploit the computation throughput.
Inspired by the observation that the computation graph of a network is easier to be optimized, if the computational intensity of the operations in a network is more balanced. We propose the variable group convolution, which is based on depthwise separable convolution krizhevsky2012imagenet ; chollet2017xception ; xie2017aggregated
. In variable group convolution, the number of input channels in each group is fixed and can be tuned as a hyperparameter, which is different from the group convolution where the number of groups are fixed. The benefits are two folds: Fixing the number of channels is more suitable for optimization from the perspective of compilers, due to the more coherent computation pattern and data layout; Compared with depthwise convolution inhoward2017mobilenets ; sandler2018mobilenetv2 , which set the group number to be the channel number, variable group convolution has a larger network capacity sandler2018mobilenetv2 , thus allowing the smaller channel numbers, which helps relief the time consuming off-chip communication.
Another key component in our network is to better exploit the on-chip memory based on the inverted residual block sandler2018mobilenetv2 . However, in MobileNetV2 sandler2018mobilenetv2 , the number of channels are adjusted by pointwise convolutions, which has a different computing pattern with the depthwise convolution in between and then is hard to be optimized due to limited computation patterns. Therefore, we propose that the input feature with channels is first expanded to by variable group convolution and returned to by pointwise convolution. In this manner, the computational costs between the two types of layers are more balanced, thus being more hardware and compiler friendly. To sum up, our contributions can be summed as follows:
We systematically analyze how to optimize the computation of CNNs from the perspective of both network architectures and hardware/compilers on embedded systems. We found that there exists a gap between the two optimization directions that some elaborately designed architectures are hard to be optimized due to limited computation patterns in an embedded system.
Observing that more unified computation pattern and data layout are more friendly to an embedded system, we propose the variable group convolution and the corresponding improved whole network, named variable group network and VarGNet for short.
Experiments on prevalent vision tasks, such as classification, detection, segmentation, face recognition and etc., and corresponding large scale datasets verify the practical value of our proposed VarGNet.
Designing lightweight CNNs has been a hot topic in recent years. Representative manual designed networks include SqueezeNet 2016_SqueezeNet , Xception chollet2017xception , MobileNets howard2017mobilenets ; sandler2018mobilenetv2 , ShuffleNets zhang2018shufflenet ; ma2018shufflenet and IGC zhang2017interleaved ; xie2018interleaved ; sun2018igcv3 . Besides, neural architecture search (NAS) zoph2016neural ; pham2018efficient ; Real2018Regularized ; zoph2017learning ; liu2018darts is a promising direction for automatically designing lightweight CNNs. The above methods are capable to effectively speed up the recognition process. More recently, platforms aware NAS methods are proposed cai2018proxylessnas ; fbnet ; dai2018chamnet ; stamoulis2019single to search some specific networks that are efficient on certain hardware platforms. Our network, VarGNet, is complementary to the existing platforms aware NAS methods, since the proposed variable group convolution is helpful for setting the search space in NAS methods.
To accelerate neural networks, FPGAs FarabetPHL09 ; ZhangLSGXC15 ; gupta2015deep ; ma2017optimizing and ASIC designs chen2014diannao ; reagen2016minerva ; jouppi2017datacenter ; luo2017dadiannao ; hegde2018ucnn have been widely studied. Generally speaking, Streaming Architectures (SAs) venieris2017fpgaconvnet ; xiao2017exploring and Single Computation Engines (SCEs) guo2016angel ; chang2017compiling ; abdelfattah2018dla are two kinds of FPGA based accelerators venieris2018toolflows . The difference between the two directions is on customization and generality. SAs designs seek customization more than generality, while SCEs emphasize the tradeoff between flexibility and customization. In this work, we hope to propose a network that can be optimized by existing accelerators more easily, thus improve the overall performance.
For chips used on embedded systems, such as FPGA or ASIC, a low unit price as well as a fast time to market are critical factors in designing the whole system. Such crucial points result in a relative simple chip configuration. In other words, the computation schemes are strictly limited when compared with general-purpose processing units. However, operators in a SOTA network are so complex that some layers can be accelerated by hardware design while others not. Thus, for designing efficient networks on embedded systems, the first intuition here is that the layers in a network should be similar as each other in some sense.
Another important intuition is based on two properties of convolutions used in CNNs. The first property is the computation pattern. In convolution, several filters (kernels) slide over the whole feature map, indicating that the kernels are repeatedly used while values from the feature map are only used once. The second property is the data size of convolutional kernels and feature maps. Typically, the size of convolutional kernels is much lower than the size of feature maps, such as for kernels and for feature maps in 2D convolutions. In light of the above two properties, an ingenious solution is to load all the data of kernels first and then perform the convolution with popping and popping out feature data sequentially xing2019dnnvm . Such practical solution is the second intuition for our following two guidelines for efficient network design on embedded systems:
It will be better if the size of intermediate feature maps between blocks is smaller.
The computational intensity of layers in a block should be balanced.
Next, we introduce the two guidelines in detail.
In SOTA networks, a common practice is to first design a normal block and a down sampling block first, and then stack several blocks together to get a deep network. Also, in these blocks, residual connectionshe2016deep are widely adopted. So, in recent compiler-side optimizations xing2019dnnvm , layers in a block are usually grouped and computed together. In such manner, off-chip memory and on-chip memory only communicates when starting or ending computing a block in the network. Therefore, a smaller intermediate feature map between blocks will certainly help reduce the data transfer time.
As mentioned before, in practice, weights in several layers are loaded before performing convolution. If the loaded layers have a large divergence in terms of the computational intensity, extra on-chip memory is needed to store the intermediate slices of feature maps. In MobileNetV1 howard2017mobilenets , a depthwise conv and a pointwise conv are used. Different from previous definitions, in our implementation, weights are already loaded. So, computational intensity is computed as MAdds divide the size of feature maps. Then, if the feature map is of size , the computational intensity of depthwise convolution and pointwise convolution are 9 and 256, respectively. As a result, when running the two layers, we have to increase the on-chip buffer to satisfy the pointwise, or not grouping the computation of the two layers together.
Based on the previous mentioned two guidelines, we propose a novel network in this section. To balance the computation intensity, we set the channel numbers in a group in a network to be constant, resulting in variable groups in each convolution layers. The motivation of fixing the channel numbers is not hard to understand if we look at the MAdds of a convolution,
Thus, if the size of feature map is a constant, then by fixing , the computational intensity inside a block is more balanced. Further, the number of channels in a group can be set to satisfy the configurations of the processing elements, in which channels of a certain number will be processed every time.
Compared with depthwise convolution, the variable group convolution increases the MAdds as well as the expressiveness sandler2018mobilenetv2 . Thus, now we are able to reduce the channel number of intermediate feature maps, while keeping the same generalizing ability as previous networks. Specifically, we design novel network blocks as shown in Fig. 1. For the normal block used in the early stages in the whole network, since the size of weights are relatively small at this time, the weights of the four layers can be all cached into the on-chip memory. When entering the late stages, where channel numbers increase and the size of weights increase as well, the normal block is also able to be optimized by only loading a variable group conv and a pointwise conv. Similarly, the operations in down sampling block are also friendly to the compiler-side and hardware-side optimizations. The whole computing process for a normal block is demonstrated in Fig. 2. Then, based on the architecture of MobileNetV1 howard2017mobilenets , we substitute their basic blocks to ours and the whole detailed network architecture is shown in Tab. 1. Also, another ShuffleNet v2 based architecture is shown in Tab. 2.
|Layer||Output Size||KSize||Stride||Repeat||Output Channels|
|Image||224 x 224||3||3||3||3||3||3||3|
|Conv 1||112 x 112||3 x 3||2||1||8||16||24||32||40||48||56|
|DownSample||56 x 56||2||3||16||32||48||64||80||96||112|
|DownSample||14 x 14||2||1||64||128||192||256||320||384||448|
|Stage Block||14 x 14||1||2||64||128||192||256||320||384||448|
|DownSample||7 x 7||2||1||128||256||384||512||640||768||896|
|Stage Block||7 x 7||1||1||128||256||384||512||640||768||896|
|Conv 5||7 x 7||1 x 1||1||1||1024||1024||1024||1024||1280||1536||1792|
|Global Pool||1 x 1||7 x 7|
|Layer||Output Size||KSize||Stride||Repeat||Output Channels|
|Image||224 x 224||3||3||3||3||3||3||3||3|
|Conv 1||112 x 112||3 x 3||2||1||8||16||24||32||40||48||56||64|
|Head Block||56 x 56||2||1||8||16||24||32||40||48||56||64|
|Stage 2||28 x28||2||1||16||32||48||64||80||96||112||128|
|28 x 28||1||2|
|Stage 3||14 x 14||2||1||32||64||96||128||160||192||224||256|
|14 x 14||1||6|
|Stage 4||7 x 7||2||1||64||128||192||256||320||384||448||512|
|7 x 7||1||3|
|Conv 5||7 x 7||1 x 1||1||1||1024||1024||1024||1024||1280||1536||1792||2048|
|Global Pool||1 x 1||7 x 7|
|(c) Comparison network: MobileNet v1|
VarGNet v1 performance on ImageNet. (is the number of channels in a group.)
|(c) Comparison network: ShuffleNet v2|
. Training hyperparameters are set as: batch size 1024, crop ratio 0.875, learning rate 0.4, cosine learning rate schedule, weight decay 4e-5 and training epochs 240. We can observe that VarGNet v1 performs better than MobileNet v1, as shown in Tab.3. From (c) in Tab. 4, we can see that when the model scale is small, the performance of VarGNet v2 is worse than ShuffleNet v2, due to less channels used in our VarGNet v2. Then, when the model size is large, our network performs better.
In Tab. 5, we present the performance of our proposed VarGNet as well as comparison methods. We evaluate the object detection performance of our proposed networks on COCO datasets Lin2014MicrosoftCC and compare them with other state-of-the-art lightweight architectures. We choose FPN-based Faster R-CNN Lin2017FeaturePN as the framework and all the experiments are implemented under the same settings with the input resolution being 8001333 and the number of epochs being 18. Specially, we find that ShuffleNet v2 achieves better accuracy if trained with more epochs so a model with 27 epochs is trained for ShuffleNet v2. 1000 proposals per image are evaluated in RPN stage at test time. We use train+val set for training except 8000 minimal images and finally test on minival set.
|MobileNet v1 1.0||24.15||31.1|
|MobileNet v2 1.0||18.71||31.0|
|ShuffleNet v1 1.0||15.31||27.9|
|ShuffleNet v2 1.0||15.55||27.5|
|ShuffleNet v2 1.0 (27 epochs)||15.55||28.9|
|VarGNet v1 1.0||24.91||33.7|
|VarGNet v2 0.5||14.98||28.6|
|VarGNet v2 1.0||19.61||33.3|
We use the standard Adam Optimizer with weight decay set to 1e-5 and batch size set to 16. The learning rate is initialized as 1e-4 and follows a polynomial decay with power of 0.9. Total training epochs are set as 100. For data augmentation, random horizontal flip is used and images are resized with scale randomly chosen from 0.6-1.2. For multitask training, we have the loss function defined as
When the task is panoptic segmentation, we set . After adding depth task, we set .
Parameters and MAdds of comparison methods are presented in Table. 6. Results and some visual examples on segmentation and depth prediction are shown in Table. 7 and Fig. 4, respectively. The priority of the proposed VarGNet v1 and v2 is proved by the above tables. VarGNet v1 and v2 are efficient and can perform equally well when compared with large networks.
|(a) Semantic Segmentation (image size 20481024)|
|(c) Panoptic Segmentation (MAdds calculated with 20481024 input size.)|
|(d) Panoptic Segmentation + Depth (MAdds calculated with 20481024 input size.)|
|Image||VarGNet v1||VarGNet v2||GT|
For single image depth prediction and stereo tasks on KITTI dataset geiger2013vision , we present the performance of our VarGNet based models. A U-Net style architecture ((b)b) is employed in the experiments. All the depth models are trained on KITTI RAW datasets, We test on 697 images from 29 scenes split by Eigen et al. eigen2014depth
, and train on about 23488 images from the remaining 32 scenes. All the experiment results are evaluated with the depth ranging from 0m to 80m and 0m to 50m. The evaluation metrics are the same as previous works. All the stereo models are trained on KITTI RAW datasets, We test on test set split by Eigen et al.eigen2014depth , and train set of KITTI15. The evaluation metrics for stereo are EPE and D1. During training, standard SGD Optimizer is used, and the momentum set to 0.9. The standard weight decay is set to 0.0001 for resnet18 and resnet50, and 0.00004 for others. The iteration number is set to 300 epochs. The initial learning rate is 0.001, and learning rate decay 0.1 at [120, 180, 240] epoch. We use 4 GPU to train models, and the batch size is set to 24.
In Table. 8 and Table. 9, we show our depth results and stereo results under various evaluation metrics. Also, we report our implemented MobileNet and ResNet as comparison. Further, visual effects are presented in Fig. 5 and Fig. 6.
|(a) On KITTI RAW|
|(b) On KITTI 15|
All the networks are trained on the DeepGlint MS-Celeb-1M-v1c dataset dg cleaned from MS-Celeb-1M guo2016ms . There are 3,923,399 aligned face images from 86,876 ids. The LFW huang2008labeled , CFP-FP sengupta2016frontal and AgeDB-30 moschoglou2017agedb are used as the validation datasets. Finally, all network models are evaluated on MegaFace Challenge 1 nech2017level . Table. 10 lists the best face recognition accuracies on validation datasets, as well as face verification true accepted rates under 1e-6 false accepted rate on the refined version of MegaFace dataset deng2018arcface . We use MobileNet v1 and MobileNet v2 as baseline models. To adapt the input image size of 112x112, the stride of the first convolutional layer is set to 1 for each baseline and vagnet model. To achieve better performance, we further replace the pooling layer by a “BN-Dropout-FC-BN” structure as InsightFace deng2018arcface , followed by the ArcFace loss deng2018arcface . The standard SGD optimizer is used with momentum 0.9 and the batch-size is set to 512 with 8 GPUs. The learning rate begins with 0.1 and is divided by 10 at the 100K, 140K and 160K iterations. We set the weight decay to be 5e-4. The embedding feature dimension is 256 with 0.4 dropout rate. The normalization scale is 64 and the ArcFace margin is set to 0.5. All training are based on the InsightFace toolbox deng2018arcface .
|Networks||MAdds||LFW huang2008labeled||CFP-FP sengupta2016frontal||AgeDB-30 moschoglou2017agedb||MegaFace deng2018arcface|
Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning.In ACM Sigplan Notices, volume 49, pages 269–284. ACM, 2014.
The cityscapes dataset for semantic urban scene understanding.In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3213–3223, 2016.
In-datacenter performance analysis of a tensor processing unit.In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), pages 1–12. IEEE, 2017.