. However, it is still a challenging task to implement face recognition on limited computational cost system such as mobile and embedded systems because of the large scale identities needed to be classified.
Many work propose lightweight networks for common computer vision tasks such asSqueezeNet, MobileNet , MobileNetV2 , ShuffleNet . SqueezeNet extensively uses convolution, achieving fewer parameters than AlexNet
while maintains AlexNet-level accuracy on ImageNet.MobileNet utilizes depthwise separable convolution to achieve a trade off between latency and accuracy. Based on this work, MobileNetV2 proposes an inverted bottleneck structure to enhance discriminative ability of network. ShuffleNet uses pointwise group convolution and channel shuffle operations to further reduce computation cost. Even though they cost small computation during inference and achieve good performance on various applications, optimization problems on embedded system still remain on embedded hardware and corresponding compilers . To handle this conflict, VarGNet  proposes a variable group convolution which can efficiently solve the unbalance of computational intensity inside a block. Meanwhile, we explore that variable group convolution has larger capacity than depthwise convolution with the same kernel size, which helps network to extract more essential information. However, VarGNet is designed for general tasks such as image classificaiton and object detection. It decreases spatial area to the half in the head setting to save memory and computational cost, while this setting is not suitable for face recognition task since detailed information of face is necessary. And there is only an average pooling layer between last conv and fully connected layer of the embedding, which may not extract enough discriminative information.
Based on VarGNet, we propose an efficient variable group convolutional network for lightweight face recognition, shorted as VarGFaceNet. In order to enhance the discriminative ability of VarGNet for large scale face recognition task, we first add SE block  and PReLU  on blocks of VarGNet
. Then we remove the downsample process at the start of network to preserve more information. To decrease parameters of network, we apply variable group convolution to shrink the feature tensor tobefore fc layer. The performance of VarGFaceNet demonstrates that this embedding setting can preserve discriminative ability while reduce parameters of the network.
To enhance the interpretation ability of lightweight network, we apply knowledge distillation during the training. There are several approaches aim at making the deep network smaller and cost-efficient, such as model pruning, model quantization and knowledge distillation. Among them, knowledge distillation is being actively investigated due to its architectural flexibility. Hinton introduces the concept of knowledge distillation and proposes to use the softmax output of teacher network to achieve knowledge distillation. To take better advantage of information from teacher network, FitNets adopts the idea of feature distillation and encourages student network to mimic the hidden feature values of teacher network. After FitNets, there are variant methods attempt to exploit the knowledge of teacher network, such as transferring the feature activation map, activation-based and gradient-based Attention Maps. Recently ShrinkTeaNet  introduces an angular distillation loss to focus on angular information of teacher model. Inspired by angular distillation loss we employ an equivalent loss with better implementation efficiency as the guide of VarGFaceNet. Moreover, to relieve the complexity of optimization caused by the discrepancy between teacher model and student model, we introduce recursive knowledge distillation which treats the model of student trained in one generation as pretrained model for the next generation.
We evaluate our model and approach on LFR challenge . LFR challenge is a lightweight face recognition challenge which requires networks whose FLOPs is under 1G and memory footprint is under 20M. VarGFaceNet achieves the state-of-the-art performance on this challenge which is shown in Section 3. Our contributions are summarized as follows:
To improve the discriminative ability of VarGNet  in large-scale face recognition we employ a different head setting and propose a new embedding block. In embedding block, we first expand channels to 1024 by convolution layer to reserve essential information, then we use variable group conv and pointwise conv to shrink the spatial area to while saving computational cost. These settings improve the performance on face recognition tasks which shown in Section 3.
To imporve the generalization ability of lightweight models, we propose recursive knowledge distillation which relieves the generalization gap between teacher models and student models in one generation.
We analyse the efficiency of variable group convolution and employ an equivalence of angular distillation loss during training. Experiments conducted to show the effectiveness of our approach.
2.1 Variable Group Convolution
Group Convolution was first introduced in AlexNet  for computational cost reduction on GPUs. Then, the cardinality of group convolution demonstrated a better performance than the dimensions of depth and width in ResNext . Designed for mobile device, MobileNet  and MobileNetV2  proposed depthwise separable convolution inspired by group convolution to save computational cost while keep discriminative ability of convolution. However, depthwise separable convolution spends 95% computation time in Conv , which causes a large MAdds gap between two consecutive laysers (Conv and Conv DW ) . This gap is unfriendly to embedded systems who load all weights of the network to perform convolution: embedded systems need extra buffers for Conv .
To keep the balance of computational intensity inside a block, VarGNet  sets the channel numbers in a group as a constant . The constant channel numbers in a group lead to the variable number of groups in a convolution, named variable group convolution. The computational cost of a variable group convolution is:
The input of this layer is and the output of that is . is the kernel size. When variable group convolution is used to replace depthwise convolution in MobileNet , the computational cost of pointwise convolution is:
The ratio of computational cost between variable group convolution and pointwise convolution is while that between depthwise convolution and pointwise convolution is . In practice, , , so . Hence, it will be more computational balanced inside a block when employs variable group convolution on the bottom of pointwise convolution instead of depthwise convolution.
Moreover, means variable group convolution has higher MAdds and larger network capacity than depthwise convoluiton (with the same kernel size), which is capable of extracting more information.
2.2 Blocks of Variable Group Network
Communication between off-chip memory and on-chip memory only happens on the start and the end of block computing when a block is grouped and computed together on embedded systems . To limit the communication cost, VarGNet sets the number of output channels to be same as the number of input channels in the normal block. Meanwhile, VarGNet expands the channels at the start of the block to channels using variable group convolution to keep the generalization ability of the block. The normal block we used is shown in Fig. 1(a), and down sampling block is shown in Fig. 1(b). Different from the blocks in VarGNet 
, we add SE block in normal block and employ PReLU instead of ReLU to increase the discriminative ability of the block.
|Layer||Output Size||KSize||Stride||Repeat||Output Channels|
2.3 Lightweight Network for Face Recognition
2.3.1 Head setting
The main challenge of face recognition is the large scale identities involved in testing/training phase. It requires discriminative ability as much as possible to support distinguishing millions of identities. In order to reserve this ability in lightweight networks, we use Conv with stride 1 at the start of network instead of Conv with stride 2 in VarGNet. It is similar to the input setting of . The output feature size of first conv in VarGNet will be downsampled while ours remains the same as input size, shown in Fig. 1(c).
2.3.2 Embedding setting
To obtain the embedding of faces, many work [2, 16] employ a fully-connected layer directly on the top of last convolution. However, the parameters of this fully-connected layer will be huge when output features from last convoluiton are relatively large. For instance, in ResNet 100  the output of last conv is , and the parameters of fc layer (embedding size is 512) are . The overall parameters of fc layer for embedding are 12.25M, and the memory footprint is 49M (float32)!
In order to design a lightweight network (memory footprint is less than 20M, FLOPs is less than 1G), we employ variable group convolution after last conv to shrink the feature maps to before fc layer. Consequently, the memory footprint of fc layer for embedding is only 1M. Fig.1(d) shows the setting of embedding block. Shrinking the feature tensor to before fc layer for embedding is risky since information contains by this feature tensor is limited. To avoid the derease of essential information, we expand channels after last conv to retain as much information as possible. Then we employ variable group convolution and pointwise convolution to decrease the parameters and computational cost while keep information.
Specifically, we first use a Conv to expand the channels from 320 to 1024. Then we employ a variable group convolution layer (8 channels in a group) to shrink the feature tensors from to . Finally, pointwise convolution is used to connect the channels and output the feature tensors to . The new embedding block setting only takes up 5.78M while the original fc layer takes up 30M () on the disk.
Experiments of comparison between our network and VarGNet in Section 3.3 demonstrate the efficiency of our network on face recognition tasks.
2.3.3 Overall architecture
The overall architecture of our lightweight network (VarGFaceNet) is illustrated in Table 1. The memory footprint of our VarGFaceNet is 20M and FLOPs is 1G. We set in a group empirically. Benefit from variable group convolution, head settings and particular embedding settings, VarGFaceNet can achieve good performance on face recognition task with limited computational cost and parameters. In Section 3, we will demonstrate the effectiveness of our network on a million distractors face recognition task.
2.4 Angular Distillation Loss
Knowledge distillation has been widely used in lightweight network training since it can transfer the interpretation ability of a big network to a smaller network . Majority tasks that used knowledge distillation are close set tasks [18, 10]
. They apply scores/logits or embeddings/feature magnitude to computedistance or cross entropy as loss. However, for open set tasks, scores/logits of training set contain limited information of testing set and the exact match of featuers maybe over-regularized in some situations. To extract useful information and avoid over-regularization,  proposes an angular distillation loss for knowledge distillation:
is the feature of teacher model, is features of student model. is the number of samples in a batch. Eq. 4
first computes cosine similarity between features of teacher and student, then minimizes thedistance between this similarity and 1. Inspired by , we propose to use Eq. 5 to enhance the implementation efficiency. Since cosine similarity is less than 1, minimize Eq. 4 is equivalent to minimize Eq. 5.
In addition, we employ arcface  as our classification loss which also pays attention to angular information:
To sum up, the objective function we used in training is:
We empirically set in our implementation.
VarGFaceNet vs. y2. Performance is recorded within the same epoch. The validation performance of VarGFaceNet is 0.6% and 0.2% higher than y2 on AgeDB-30 and CFP-FP respectively. Testing result of VarGFaceNet is 5% higher than y2.
2.5 Recursive Knowledge Distillation
Knowledge distillation with one generation is sometimes difficult to transfer enough knowledge when large discrepancy exists between teacher models and student models. For instance, in our implementation, the FLOPs of teacher model is 24G while that of student model is 1G. And the number of parameters of teacher model is 108M while that of student model is 5M. Moreover, the different architecture and block settings between teacher model and student model increase the complexity of training as well. To improve the discriminative and generalization ability of our student network, we propose recursive knowledge distillation, which employs the first generation of student to initialize the second generation of student, as shown in Fig. 2.
In recursive knowledge distillation, we employ the same teacher model in all generations. That means the angular information of samples which guides the student model is invariable. There are two merits if we use recursive knowledge distillation:
It will be easier to approach guided direction of teacher when apply a good initialization.
The conflicts between margin of classification loss and guided angular information in the first generation will be relieved in the next generation.
The results of our experiments in Section 3 illustrate the performance of recursive knowledge distillation.
Verification results of LFW, AgeDB-30 are increased in the second generation. Performance of testing set deepglint-light(TPR@FPR=1e-8) is increased by 0.4% the same time.
In this section, we first introduce the datasets and evaluation metric. Then, to demonstrate the effectiveness of our VarGFaceNet, we compare our network with y2 network(a deeper mobilefacenet[1, 2]). After that, the investigation for the effect of different teacher models in knowledge distillation is revealed. Finally, we show the competitive performance of VarGFaceNet using recursive knowledge distillation on LFR2019 Challenge.
3.1 Datasets and Evaluation Metric
We employ the dataset(clean from MS1M) provided by LFR2019 for training. All face images in this dataset are aligned by five facial landmarks predicted from RetinaFace then resized to . There are 5.1M images collected from 93K identities. For test set, Trillion-pairs dataset  is used. It contains two parts: 1) ELFW: Face images of celebrities in the LFW name list. There are 274K images from 5.7K identities; 2) DELFW: Distractors for ELFW. There are 1.58 M face images from Flickr. All test images are preprocessed and resized to . We refer deepglint-light to trillionpairs testing set in the following. During the training, we utilize face verification datasets (e.g. LFW, CFP-FP, AgeDB-30) to validate different settings using 1:1 verification protocol. Moreover, we employ the TPR@FPR=1e-8 as evaluation metric for identification.
3.2 VarGFaceNet train from scratch
To validate the efficiency and effectiveness of VarGFaceNet, we first train our network from scratch, and compare the performance with mobilefacenet(y2) [1, 2]. We employ arcface loss as the objective function of classification during training. Tabel 2 presents the comparison results of VarGFaceNet and y2. It can be observed that under the limitation of 1G FLOPs, VarGFaceNet is able to reach better face recognition performance on validation sets. Compared with y2, our verification results of AgeDB-30 , CFP-FP have increased 0.6% and 0.2% respectively, testing result of deepglint-light (TPR@FPR=1e-8) has increased 5%. There are two intuitions for the better performance: 1. our network can contain more parameters than y2 when limit FLOPs because of variable group convolution. The biggest number of channels is 256 in y2 while ours is 320 before last conv. 2. Our embedding setting can extract more essential information. y2 expands the number of channels from 256 to 512 then use depthwise convolution to get the feature tensor before fc layer. We expand the number of channels from 320 to 1024 then use variable group convolution and pointwise convolution which have larger network capacity.
3.3 VarGFaceNet guided by ResNet
In order to achieve higher performance than train from scratch, bigger networks are applied to perform knowledge distillation using angular distillation loss. Moreover, we conduct experiments to investigate the effect of different teacher models on VarGFaceNet. We employ ResNet 100  with SE as our teacher model. The teacher model has 24G FLOPs and 108M parameters. The results are illustrated in Tabel 3. It can be observed that 1. even though the architectures of teacher and student are quite different, VarGFaceNet still approaches the performance of ResNet; 2. the performance of VarGFaceNet is highly correlated with teacher model. The higher performance teacher model has, the better interpretation ability VarGFaceNet will learn.
To validate the efficiency of our settings, we conduct comparison experiments between our network and VarGNet. Using the same teacher network, we change the head setting of VarGNet
to our head setting for fair comparison and use the same loss function as above. In Tabel4, the plain VarGNet has lower accuracy in LFW, CFP-FP, AgeDB-30. There is only an average pooling between last conv and fc layer in VarGNet. The results illustrate that our embedding setting is more suitable for face recognition task since it can extract more essential information.
3.4 Recursive Knowledge Distillation
As we discuss in Section 2.5, when there is a large discrepancy between teacher model and student, knowledge distillation for one generation may not enough for knowledge transfer. To validate it, we use ResNet 100 model as our teacher model, and conduct recursive knowledge distillation on VarGFaceNet. A performance improvement shown in Table 5 when we train the model in next generation. The varification result of LFW and CFP-FP is increased by 0.1% while testing result of deepglint-light(TPR@FPR=1e-8) is 0.4% higher than pervious generation. Furthermore, we believe that it will lead to better performance if we continue to conduct training in more generations.
In this paper, we propose an efficient lightweight network called VarGFaceNet for large scale face recognition. Benefit from variable group convolution, VarGFaceNet is capable of finding a better trade-off between efficiency and performance. The head setting and embedding setting specific to face recogniton help preserve information while reduce parametes. Moreover, to improve the interpretation ability of lightweight network, we employ an equivalence of angular distillation loss as our objective function and present a recursive knowledge distillation strategy. The state-of-the-art performance on LFR challenge demonstrates the superiority of our method.
Acknowledgments We would like to thank Xin Wang, Helong Zhou, Zhichao Li, Xiao Jiang, Yuxiang Tuo for their helpful discussion.
-  (2018) Mobilefacenets: efficient cnns for accurate real-time face verification on mobile devices. In Chinese Conference on Biometric Recognition, pp. 428–438. Cited by: §2.5, §3.2.
Arcface: additive angular margin loss for deep face recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §1, §2.3.1, §2.3.2, §2.4, §2.5, §3.2.
-  (2019) Lightweight face recognition challenge. In Proceedings of the IEEE International Conference on Computer Vision. Cited by: §1.
-  (2019) RetinaFace: single-stage dense face localisation in the wild. arXiv preprint arXiv:1905.00641. Cited by: §3.1.
-  (2019) ShrinkTeaNet: million-scale lightweight face recognition via shrinking teacher-student networks. arXiv preprint arXiv:1905.10620. Cited by: §1, §2.4, §2.4.
-  (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pp. 87–102. Cited by: §3.1.
-  (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.3.
Knowledge transfer via distillation of activation boundaries formed by hidden neurons. arXiv preprint arXiv:1811.03233. Cited by: §1.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §2.4.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §2.1, §2.1, §2.4.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1.
-  (2008) Labeled faces in the wild: a database forstudying face recognition in unconstrained environments. Cited by: §3.1.
-  (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: §1.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §2.1.
-  (2017) Sphereface: deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220. Cited by: §1, §2.3.2.
-  (2017) Agedb: the first manually collected, in-the-wild age database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 51–59. Cited by: §3.1.
-  (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. Cited by: §1, §2.4.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §1, §2.1.
-  (2016) Frontal to profile face verification in the wild. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. Cited by: §3.1.
-  Trillionpairs. Note: http://trillionpairs.deepglint.com/overviewAccessed July, 2019 Cited by: §3.1.
-  (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §2.1.
-  (2019) DNNVM: end-to-end compiler leveraging heterogeneous optimizations on fpga-based cnn accelerators. arXiv preprint arXiv:1902.07463. Cited by: §2.1, §2.2.
A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141. Cited by: §1.
-  (2019) VarGNet: variable group convolutional neural network for efficient embedded computing. arXiv preprint arXiv:1907.05653. Cited by: 1st item, §1, §2.1, §2.2.
-  (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §1.