VarGNet: Variable Group Convolutional Neural Network for Efficient Embedded Computing

07/12/2019 ∙ by Qian Zhang, et al. ∙ 8

In this paper, we propose a novel network design mechanism for efficient embedded computing. Inspired by the limited computing patterns, we propose to fix the number of channels in a group convolution, instead of the existing practice that fixing the total group numbers. Our solution based network, named Variable Group Convolutional Network (VarGNet), can be optimized easier on hardware side, due to the more unified computing schemes among the layers. Extensive experiments on various vision tasks, including classification, detection, pixel-wise parsing and face recognition, have demonstrated the practical value of our VarGNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Empowering embedded systems to run the well-known deep learning architectures, such as convolutional neural networks (CNNs), has been a hot topic in recent years. For smart Internet of Things applications, the challenging part is that the whole system is required to be both energy-constrained and of small size. To meet the challenge, the work of improving the efficiency of the whole computing process can be roughly broken into two directions: The first is to design lightweight networks which has a small MAdds

howard2017mobilenets ; sandler2018mobilenetv2 ; zhang2018shufflenet ; ma2018shufflenet , thus friendly to low power consumption platforms; The second is to optimize hardware-side configurations, such as FPGA based accelerators FarabetPHL09 ; ZhangLSGXC15 , or to make the whole computing process more efficient by improving the compiler and generating more smart instructions abdelfattah2018dla ; chen2018tvm ; xing2019dnnvm .

All of the mentioned works above have demonstrated their great practical value in various applications. However, the real performance may not live up to the designer’s expectations, due to the gap between the two different optimization directions. Specifically, for elaborately tuned networks with small MAdds, the overall latency may be high ma2018shufflenet , while for carefully designed compilers or accelerators, the real networks may be hard to be processed.

In this work, we intend to close the exiting gap by systematically analyze the necessary properties of a lightweight network that is friendly to the embedded hardware and the corresponding compilers. More precisely, since the computation patterns of a chip in a embedded system is strictly limited, we propose that a embedded-system-friendly network should fit into the targeted computation patterns and also the ideal data layout. By fitting into the ideal data layout, we can reduce the communication cost between on-chip memory and off-chip memory, thus fully exploit the computation throughput.

Inspired by the observation that the computation graph of a network is easier to be optimized, if the computational intensity of the operations in a network is more balanced. We propose the variable group convolution, which is based on depthwise separable convolution krizhevsky2012imagenet ; chollet2017xception ; xie2017aggregated

. In variable group convolution, the number of input channels in each group is fixed and can be tuned as a hyperparameter, which is different from the group convolution where the number of groups are fixed. The benefits are two folds: Fixing the number of channels is more suitable for optimization from the perspective of compilers, due to the more coherent computation pattern and data layout; Compared with depthwise convolution in

howard2017mobilenets ; sandler2018mobilenetv2 , which set the group number to be the channel number, variable group convolution has a larger network capacity sandler2018mobilenetv2 , thus allowing the smaller channel numbers, which helps relief the time consuming off-chip communication.

Another key component in our network is to better exploit the on-chip memory based on the inverted residual block sandler2018mobilenetv2 . However, in MobileNetV2 sandler2018mobilenetv2 , the number of channels are adjusted by pointwise convolutions, which has a different computing pattern with the depthwise convolution in between and then is hard to be optimized due to limited computation patterns. Therefore, we propose that the input feature with channels is first expanded to by variable group convolution and returned to by pointwise convolution. In this manner, the computational costs between the two types of layers are more balanced, thus being more hardware and compiler friendly. To sum up, our contributions can be summed as follows:

  • We systematically analyze how to optimize the computation of CNNs from the perspective of both network architectures and hardware/compilers on embedded systems. We found that there exists a gap between the two optimization directions that some elaborately designed architectures are hard to be optimized due to limited computation patterns in an embedded system.

  • Observing that more unified computation pattern and data layout are more friendly to an embedded system, we propose the variable group convolution and the corresponding improved whole network, named variable group network and VarGNet for short.

  • Experiments on prevalent vision tasks, such as classification, detection, segmentation, face recognition and etc., and corresponding large scale datasets verify the practical value of our proposed VarGNet.

1.1 Related works

Lightweight CNNs.

Designing lightweight CNNs has been a hot topic in recent years. Representative manual designed networks include SqueezeNet 2016_SqueezeNet , Xception chollet2017xception , MobileNets howard2017mobilenets ; sandler2018mobilenetv2 , ShuffleNets zhang2018shufflenet ; ma2018shufflenet and IGC zhang2017interleaved ; xie2018interleaved ; sun2018igcv3 . Besides, neural architecture search (NAS) zoph2016neural ; pham2018efficient ; Real2018Regularized ; zoph2017learning ; liu2018darts is a promising direction for automatically designing lightweight CNNs. The above methods are capable to effectively speed up the recognition process. More recently, platforms aware NAS methods are proposed cai2018proxylessnas ; fbnet ; dai2018chamnet ; stamoulis2019single to search some specific networks that are efficient on certain hardware platforms. Our network, VarGNet, is complementary to the existing platforms aware NAS methods, since the proposed variable group convolution is helpful for setting the search space in NAS methods.

Optimizations on CNN accelerators.

To accelerate neural networks, FPGAs FarabetPHL09 ; ZhangLSGXC15 ; gupta2015deep ; ma2017optimizing and ASIC designs chen2014diannao ; reagen2016minerva ; jouppi2017datacenter ; luo2017dadiannao ; hegde2018ucnn have been widely studied. Generally speaking, Streaming Architectures (SAs) venieris2017fpgaconvnet ; xiao2017exploring and Single Computation Engines (SCEs) guo2016angel ; chang2017compiling ; abdelfattah2018dla are two kinds of FPGA based accelerators venieris2018toolflows . The difference between the two directions is on customization and generality. SAs designs seek customization more than generality, while SCEs emphasize the tradeoff between flexibility and customization. In this work, we hope to propose a network that can be optimized by existing accelerators more easily, thus improve the overall performance.

2 Designing Efficient Networks on Embedded Systems

For chips used on embedded systems, such as FPGA or ASIC, a low unit price as well as a fast time to market are critical factors in designing the whole system. Such crucial points result in a relative simple chip configuration. In other words, the computation schemes are strictly limited when compared with general-purpose processing units. However, operators in a SOTA network are so complex that some layers can be accelerated by hardware design while others not. Thus, for designing efficient networks on embedded systems, the first intuition here is that the layers in a network should be similar as each other in some sense.

Another important intuition is based on two properties of convolutions used in CNNs. The first property is the computation pattern. In convolution, several filters (kernels) slide over the whole feature map, indicating that the kernels are repeatedly used while values from the feature map are only used once. The second property is the data size of convolutional kernels and feature maps. Typically, the size of convolutional kernels is much lower than the size of feature maps, such as for kernels and for feature maps in 2D convolutions. In light of the above two properties, an ingenious solution is to load all the data of kernels first and then perform the convolution with popping and popping out feature data sequentially xing2019dnnvm . Such practical solution is the second intuition for our following two guidelines for efficient network design on embedded systems:

  • It will be better if the size of intermediate feature maps between blocks is smaller.

  • The computational intensity of layers in a block should be balanced.

Next, we introduce the two guidelines in detail.

Small intermediate feature maps between blocks.

In SOTA networks, a common practice is to first design a normal block and a down sampling block first, and then stack several blocks together to get a deep network. Also, in these blocks, residual connections

he2016deep are widely adopted. So, in recent compiler-side optimizations xing2019dnnvm , layers in a block are usually grouped and computed together. In such manner, off-chip memory and on-chip memory only communicates when starting or ending computing a block in the network. Therefore, a smaller intermediate feature map between blocks will certainly help reduce the data transfer time.

Balanced computational intensity inside a block.

As mentioned before, in practice, weights in several layers are loaded before performing convolution. If the loaded layers have a large divergence in terms of the computational intensity, extra on-chip memory is needed to store the intermediate slices of feature maps. In MobileNetV1 howard2017mobilenets , a depthwise conv and a pointwise conv are used. Different from previous definitions, in our implementation, weights are already loaded. So, computational intensity is computed as MAdds divide the size of feature maps. Then, if the feature map is of size , the computational intensity of depthwise convolution and pointwise convolution are 9 and 256, respectively. As a result, when running the two layers, we have to increase the on-chip buffer to satisfy the pointwise, or not grouping the computation of the two layers together.

(a) Normal block.
(b) Down sampling block.
Figure 1: Variable Group Network.

3 Variable Group Network

Based on the previous mentioned two guidelines, we propose a novel network in this section. To balance the computation intensity, we set the channel numbers in a group in a network to be constant, resulting in variable groups in each convolution layers. The motivation of fixing the channel numbers is not hard to understand if we look at the MAdds of a convolution,

Thus, if the size of feature map is a constant, then by fixing , the computational intensity inside a block is more balanced. Further, the number of channels in a group can be set to satisfy the configurations of the processing elements, in which channels of a certain number will be processed every time.

Compared with depthwise convolution, the variable group convolution increases the MAdds as well as the expressiveness sandler2018mobilenetv2 . Thus, now we are able to reduce the channel number of intermediate feature maps, while keeping the same generalizing ability as previous networks. Specifically, we design novel network blocks as shown in Fig. 1. For the normal block used in the early stages in the whole network, since the size of weights are relatively small at this time, the weights of the four layers can be all cached into the on-chip memory. When entering the late stages, where channel numbers increase and the size of weights increase as well, the normal block is also able to be optimized by only loading a variable group conv and a pointwise conv. Similarly, the operations in down sampling block are also friendly to the compiler-side and hardware-side optimizations. The whole computing process for a normal block is demonstrated in Fig. 2. Then, based on the architecture of MobileNetV1 howard2017mobilenets , we substitute their basic blocks to ours and the whole detailed network architecture is shown in Tab. 1. Also, another ShuffleNet v2 based architecture is shown in Tab. 2.

Figure 2: Computing scheme of a normal block in Variable Group Network. The weights of four convolution operations are first loaded onto on-chip memory, and then processing the features.
Layer Output Size KSize Stride Repeat Output Channels
0.25x 0.5x 0.75x 1x 1.25x 1.5x 1.75x
Image 224 x 224 3 3 3 3 3 3 3
Conv 1 112 x 112 3 x 3 2 1 8 16 24 32 40 48 56
DownSample 56 x 56 2 3 16 32 48 64 80 96 112
DownSample 28 x28 2 1 32 64 96 128 160 192 224
DownSample 14 x 14 2 1 64 128 192 256 320 384 448
Stage Block 14 x 14 1 2 64 128 192 256 320 384 448
DownSample 7 x 7 2 1 128 256 384 512 640 768 896
Stage Block 7 x 7 1 1 128 256 384 512 640 768 896
Conv 5 7 x 7 1 x 1 1 1 1024 1024 1024 1024 1280 1536 1792
Global Pool 1 x 1 7 x 7
FC 1000 1000 1000 1000 1000 1000 1000
Table 1: Overall architecture of Variable Group Network v1.
Layer Output Size KSize Stride Repeat Output Channels
0.25x 0.5x 0.75x 1x 1.25x 1.5x 1.75x 2x
Image 224 x 224 3 3 3 3 3 3 3 3
Conv 1 112 x 112 3 x 3 2 1 8 16 24 32 40 48 56 64
Head Block 56 x 56 2 1 8 16 24 32 40 48 56 64
Stage 2 28 x28 2 1 16 32 48 64 80 96 112 128
28 x 28 1 2
Stage 3 14 x 14 2 1 32 64 96 128 160 192 224 256
14 x 14 1 6
Stage 4 7 x 7 2 1 64 128 192 256 320 384 448 512
7 x 7 1 3
Conv 5 7 x 7 1 x 1 1 1 1024 1024 1024 1024 1280 1536 1792 2048
Global Pool 1 x 1 7 x 7
FC 1000 1000 1000 1000 1000 1000 1000 1000
Table 2: Overall architecture of Variable Group Network v2. Head Block is a modified version of Normal Block, by setting the stride to 2 and keeping the channel numbers unchanged after two variable convolution layers.
(a)
Model Scale Acc(top1) Model size MAdds Max Channels
0.25 63.80% 1.44M 55M 128
0.5 69.71% 2.23M 157M 256
0.75 72.38% 3.43M 309M 384
1 73.64% 5.02M 509M 512
1.25 74.34% 7.42M 767M 640
1.5 74.47% 10.28M 1.05G 768
(b)
Model Scale Acc(top1) Model size MAdds Max Channels
0.25 64.90% 1.5M 75M 128
0.5 70.40% 2.37M 198M 256
0.75 72.60% 3.66M 370M 384
1 73.90% 5.33M 590M 512
1.25 74.70% 7.8M 869M 640
1.5 75.00% 10.7M 1.17G 768
1.75 75.30% 14.1M 1.54G 1024
(c) Comparison network: MobileNet v1
Model Scale Acc(top1) Model size MAdds Max Channels
0.35 60.4% 0.7 M 72 M 358
0.6 68.6% 1.7 M 201 M 614
0.85 72.0% 3.1M 394 M 870
1.0 73.3% 4.1M 542 M 1024
1.05 73.5% 4.4 M 594 M 1075
1.3 74.7% 6.4 M 903 M 1331
1.5 75.1% 8.3 M 1.17 G 1536
Table 3:

VarGNet v1 performance on ImageNet. (

is the number of channels in a group.)
(a)
Model Scale Acc(top1) Model size MAdds Max Channels
0.25 59.39% 1.27M 35M 64
0.5 66.98% 1.72M 92M 128
0.75 70.42% 2.35M 173M 192
1 72.76% 3.19M 278M 256
1.25 74.08% 4.55M 411M 320
1.5 74.91% 6.14M 569M 384
2 75.44% 10.0M 961M 512
(b)
Model Scale Acc(top1) Model size MAdds Max Channels
0.25 59.81% 1.35M 51M 64
0.5 67.80% 1.87M 124M 128
0.75 70.36% 2.58M 222M 192
1 73.10% 3.49M 343M 256
1.25 74.34% 4.94M 492M 320
1.5 75.04% 6.60M 666M 384
1.75 75.49% 8.50M 866M 448
2 75.71% 10.6M 1.06G 512
(c) Comparison network: ShuffleNet v2
Model Scale Acc(top1) Model size MAdds Max Channels
0.25 (60) 63.85% 1.47M 51M 240
0.5 (108) 68.74% 2.1M 123M 432
0.75 (154) 71.65% 2.92M 223M 616
1 (196) 73.17% 3.87M 342M 784
1.25 (228) 74.15% 6.63M 494M 912
1.5 (270) 74.56% 8.06M 666M 1080
1.75 (312) 75.24% 9.68M 863M 1248
Table 4: VarGNet v2 performance on ImageNet. ( is the number of channels in a group.)

4 Experiments

4.1 ImageNet Classification

The results of our model on ImageNet are presented in Tab. 3 and Tab. 4

. Training hyperparameters are set as: batch size 1024, crop ratio 0.875, learning rate 0.4, cosine learning rate schedule, weight decay 4e-5 and training epochs 240. We can observe that VarGNet v1 performs better than MobileNet v1, as shown in Tab.

3. From (c) in Tab. 4, we can see that when the model scale is small, the performance of VarGNet v2 is worse than ShuffleNet v2, due to less channels used in our VarGNet v2. Then, when the model size is large, our network performs better.

4.2 Object Detection

In Tab. 5, we present the performance of our proposed VarGNet as well as comparison methods. We evaluate the object detection performance of our proposed networks on COCO datasets Lin2014MicrosoftCC and compare them with other state-of-the-art lightweight architectures. We choose FPN-based Faster R-CNN Lin2017FeaturePN as the framework and all the experiments are implemented under the same settings with the input resolution being 8001333 and the number of epochs being 18. Specially, we find that ShuffleNet v2 achieves better accuracy if trained with more epochs so a model with 27 epochs is trained for ShuffleNet v2. 1000 proposals per image are evaluated in RPN stage at test time. We use train+val set for training except 8000 minimal images and finally test on minival set.

Network MAdds (G) mAP
MobileNet v1 1.0 24.15 31.1
MobileNet v2 1.0 18.71 31.0
ShuffleNet v1 1.0 15.31 27.9
ShuffleNet v2 1.0 15.55 27.5
ShuffleNet v2 1.0 (27 epochs) 15.55 28.9
VarGNet v1 1.0 24.91 33.7
VarGNet v2 0.5 14.98 28.6
VarGNet v2 1.0 19.61 33.3
Table 5: Performance on COCO object detection with FPN based Faster R-CNN. The input image size is 800 1333.

4.3 Pixel Level Parsing

4.3.1 Cityscapes

On Cityscapes dataset cordts2016cityscapes , we designed a multi-task structure (Fig. (a)a) to conduct two important pixel level parsing tasks: single image depth prediction and segmentation.

(a) The multi-task network used in Cityscapes experiments.
(b) The U-Net style network used in KITTI experiments.
Figure 3: Network architectures.
Traning setup.

We use the standard Adam Optimizer with weight decay set to 1e-5 and batch size set to 16. The learning rate is initialized as 1e-4 and follows a polynomial decay with power of 0.9. Total training epochs are set as 100. For data augmentation, random horizontal flip is used and images are resized with scale randomly chosen from 0.6-1.2. For multitask training, we have the loss function defined as

When the task is panoptic segmentation, we set . After adding depth task, we set .

Results.

Parameters and MAdds of comparison methods are presented in Table. 6. Results and some visual examples on segmentation and depth prediction are shown in Table. 7 and Fig. 4, respectively. The priority of the proposed VarGNet v1 and v2 is proved by the above tables. VarGNet v1 and v2 are efficient and can perform equally well when compared with large networks.

Method Backbone MAdds(G) Params
SegNetbadrinarayanan2017segnet VGG16 286.0 29.5M
Enetpaszke2016enet From scratch 3.8 0.4M
BiSeNetyu2018bisenet Xception39 2.9 5.8M
BiSeNetyu2018bisenet Res18 10.8 49.0M
MobileNet v2 - 6.82 7.64M
VarGNet v1 - 6.16 13.23M
VarGNet v2 - 2.76 7.41M
Table 6: Details of comparison methods and ours on pixel level parsing tasks with input size 640360.
(a) Semantic Segmentation (image size 20481024)
Method Backbone Mean IoU(%)
BiSeNetyu2018bisenet Xception39 69.0
BiSeNetyu2018bisenet Res18 74.8
MobileNet v2 - 64.8
VarGNet v1 - 76.6
VarGNet v2 - 74.2
(b) Depth
Method AbsRel SqlRel RMSE RMSE Log
MobileNet v2 0.167 3.22 15.46 0.553
VarGNet v1 0.092 1.327 8.864 0.163
VarGNet v2 0.096 1.404 8.85 0.168
(c) Panoptic Segmentation (MAdds calculated with 20481024 input size.)
Method Backbone MAdds PQ PQ(Things) PQ(Stuff) Mean IoU(%)
PFPnetPanoptic resnet101 533G 58.1 52 62.5 75.1
VarGNet v1 - 104G 57.1 50 62.3 73.4
VarGNet v2 - 68G 54.5 45.1 59.8 71.4
(d) Panoptic Segmentation + Depth (MAdds calculated with 20481024 input size.)
Method MAdds PQ PQ(Things) PQ(Stuff) Mean IoU(%) AbsRel RMSE
VarGNet v1 109G 56 48.8 61.3 71 0.1 9.2
VarGNet v2 70G 53.9 46.2 59.5 70.5 0.116 10.06
Table 7: Results on Cityscapes validation set.
Image VarGNet v1 VarGNet v2 GT
Semantic Segmentation
Panoptic Segmentation
Depth
Semantic Segmentation
Panoptic Segmentation
Depth
Semantic Segmentation
Panoptic Segmentation
Depth
Figure 4: Visual results on Cityscapes validation set.

4.4 Kitti

Traning setup.

For single image depth prediction and stereo tasks on KITTI dataset geiger2013vision , we present the performance of our VarGNet based models. A U-Net style architecture ((b)b) is employed in the experiments. All the depth models are trained on KITTI RAW datasets, We test on 697 images from 29 scenes split by Eigen et al. eigen2014depth

, and train on about 23488 images from the remaining 32 scenes. All the experiment results are evaluated with the depth ranging from 0m to 80m and 0m to 50m. The evaluation metrics are the same as previous works. All the stereo models are trained on KITTI RAW datasets, We test on test set split by Eigen et al.

eigen2014depth , and train set of KITTI15. The evaluation metrics for stereo are EPE and D1. During training, standard SGD Optimizer is used, and the momentum set to 0.9. The standard weight decay is set to 0.0001 for resnet18 and resnet50, and 0.00004 for others. The iteration number is set to 300 epochs. The initial learning rate is 0.001, and learning rate decay 0.1 at [120, 180, 240] epoch. We use 4 GPU to train models, and the batch size is set to 24.

Results.

In Table. 8 and Table. 9, we show our depth results and stereo results under various evaluation metrics. Also, we report our implemented MobileNet and ResNet as comparison. Further, visual effects are presented in Fig. 5 and Fig. 6.

(a) 0-80m
Method AbsRel SqlRel RMSE RMSE Log <1.25 <1.25 <1.25 MAdds(G) Params
MobileNet v2 1.0 0.103 0.744 4.686 0.17 0.888 0.966 0.987 36.8 7.6 M
MobileNet v2 0.5 0.112 0.865 5.01 0.183 0.869 0.959 0.983 10.0 1.9 M
MobileNet v2 0.25 0.113 0.831 4.988 0.183 0.866 0.96 0.985 2.9 539.2 K
ResNet 18 0.109 0.767 4.76 0.178 0.869 0.961 0.986 203.4 30.6 M
ResNet 50 0.109 0.788 4.796 0.18 0.868 0.959 0.984 247.5 46.7 M
VarGNet v1 1.0 0.105 0.798 4.92 0.175 0.883 0.965 0.986 36.0 13.2 M
VarGNet v1 0.5 0.107 0.803 4.86 0.175 0.881 0.964 0.986 12.8 3.8 M
VarGNet v1 0.25 0.113 0.845 5.003 0.18 0.87 0.962 0.986 5.1 1.2 M
VarGNet v2 1.0 0.108 0.823 4.898 0.176 0.881 0.965 0.986 20.0 7.4 M
VarGNet v2 0.5 0.111 0.851 4.98 0.179 0.874 0.961 0.985 7.7 2.2 M
VarGNet v2 0.25 0.118 0.9 5.11 0.186 0.863 0.959 0.985 3.3 788.1 K
(b) 0-50m
Method AbsRel SqlRel RMSE RMSE Log <1.25 <1.25 <1.25 MAdds(G) Params
MobileNet v2 1.0 0.097 0.557 3.424 0.155 0.903 0.972 0.989 36.8 7.6 M
MobileNet v2 0.5 0.106 0.649 3.665 0.167 0.886 0.966 0.986 10.0 1.9 M
MobileNet v2 0.25 0.106 0.63 3.693 0.168 0.883 0.966 0.988 2.9 539.2 K
ResNet 18 0.104 0.584 3.525 0.164 0.883 0.967 0.988 203.4 30.6 M
ResNet 50 0.104 0.592 3.521 0.165 0.883 0.965 0.987 247.5 46.7 M
VarGNet v1 1.0 0.098 0.578 3.534 0.158 0.899 0.973 0.99 36.0 13.2 M
VarGNet v1 0.5 0.1 0.603 3.535 0.159 0.897 0.97 0.989 12.8 3.8 M
VarGNet v1 0.25 0.106 0.637 3.648 0.165 0.887 0.969 0.989 5.1 1.2 M
VarGNet v2 1.0 0.101 0.612 3.556 0.16 0.896 0.971 0.989 20.0 7.4 M
VarGNet v2 0.5 0.104 0.635 3.639 0.163 0.89 0.968 0.988 7.7 2.2 M
VarGNet v2 0.25 0.112 0.681 3.768 0.171 0.88 0.966 0.988 3.3 788.1 K
Table 8: Depth results on KITTI test set.
(a) On KITTI RAW
Method EPE D1 MAdds(G) Params
MobileNet v2 1.0 1.424 0.0777 37.0 7.6 M
MobileNet v2 0.5 1.4904 0.0832 10.1 1.9 M
MobileNet v2 0.25 1.5897 0.0902 2.9 539.5 K
ResNet 18 1.5269 0.0886 205.4 30.6 M
ResNet 50 1.531 0.0887 249.5 46.7 M
VarGNet v1 1.0 1.3296 0.0703 36.1 13.2 M
VarGNet v1 0.5 1.4045 0.0757 12.9 3.8 M
VarGNet v1 0.25 1.5111 0.0835 5.1 1.2 M
VarGNet v2 1.0 1.3582 0.0728 20.7 7.4 M
VarGNet v2 0.5 1.44 0.079 8.0 2.2 M
VarGNet v2 0.25 1.5346 0.0862 3.4 790.2 K
(b) On KITTI 15
Method EPE D1 MAdds Params
MobileNet v2 1.0 1.7387 0.0753 37.0 7.6 M
MobileNet v2 0.5 1.6861 0.0772 10.1 1.9 M
MobileNet v2 0.25 1.6754 0.0819 2.9 539.5 K
ResNet 18 1.7318 0.0873 205.4 30.6 M
ResNet 50 1.7305 0.0868 249.5 46.7 M
VarGNet v1 1.0 1.5767 0.07 36.1 13.2 M
VarGNet v1 0.5 1.5868 0.0708 12.9 3.8 M
VarGNet v1 0.25 1.6685 0.0747 5.1 1.2 M
VarGNet v2 1.0 1.5856 0.0697 20.7 7.4 M
VarGNet v2 0.5 1.5994 0.0735 8.0 2.2 M
VarGNet v2 0.25 1.6302 0.0777 3.4 790.2 K
Table 9: Stereo results on KITTI.
(a) Input Image
(b) GT
(c) MobileNet v2 1.0
(d) MobileNet v2 0.5
(e) MobileNet v2 0.25
(f) ResNet 18
(g) ResNet 50
(h) VarGNet v1 1.0
(i) VarGNet v1 0.5
(j) VarGNet v1 0.25
(k) VarGNet v2 1.0
(l) VarGNet v2 0.5
(m) VarGNet v2 0.25
Figure 5: Visualization of depth results on KITTI RAW.
(a) Left Input Image
(b) Right Input Image
(c) GT
(d) MobileNet v2 1.0
(e) MobileNet v2 0.5
(f) MobileNet v2 0.25
(g) ResNet 18
(h) ResNet 50
(i) VarGNet v1 1.0
(j) VarGNet v1 0.5
(k) VarGNet v1 0.25
(l) VarGNet v2 1.0
(m) VarGNet v2 0.5
(n) VarGNet v2 0.25
Figure 6: Visualization of stereo results on KITTI15.

4.5 Face Recognition

All the networks are trained on the DeepGlint MS-Celeb-1M-v1c dataset dg cleaned from MS-Celeb-1M guo2016ms . There are 3,923,399 aligned face images from 86,876 ids. The LFW huang2008labeled , CFP-FP sengupta2016frontal and AgeDB-30 moschoglou2017agedb are used as the validation datasets. Finally, all network models are evaluated on MegaFace Challenge 1 nech2017level . Table. 10 lists the best face recognition accuracies on validation datasets, as well as face verification true accepted rates under 1e-6 false accepted rate on the refined version of MegaFace dataset deng2018arcface . We use MobileNet v1 and MobileNet v2 as baseline models. To adapt the input image size of 112x112, the stride of the first convolutional layer is set to 1 for each baseline and vagnet model. To achieve better performance, we further replace the pooling layer by a “BN-Dropout-FC-BN” structure as InsightFace deng2018arcface , followed by the ArcFace loss deng2018arcface . The standard SGD optimizer is used with momentum 0.9 and the batch-size is set to 512 with 8 GPUs. The learning rate begins with 0.1 and is divided by 10 at the 100K, 140K and 160K iterations. We set the weight decay to be 5e-4. The embedding feature dimension is 256 with 0.4 dropout rate. The normalization scale is 64 and the ArcFace margin is set to 0.5. All training are based on the InsightFace toolbox deng2018arcface .

Networks MAdds LFW huang2008labeled CFP-FP sengupta2016frontal AgeDB-30 moschoglou2017agedb MegaFace deng2018arcface
MobileNet v1 554M 0.99617 0.89714 0.96600 0.935848
MobileNet v2 313M 0.99500 0.86386 0.95583 0.898219
VarGNet v1 603M 0.99733 0.88929 0.97583 0.961499
VarGNet v2 355M 0.99733 0.89829 0.97333 0.954261
Table 10: Face recognition results.

References