QuantFace: Towards Lightweight Face Recognition by Synthetic Data Low-bit Quantization

06/21/2022
by   Fadi Boutros, et al.
Fraunhofer
0

Deep learning-based face recognition models follow the common trend in deep neural networks by utilizing full-precision floating-point networks with high computational costs. Deploying such networks in use-cases constrained by computational requirements is often infeasible due to the large memory required by the full-precision model. Previous compact face recognition approaches proposed to design special compact architectures and train them from scratch using real training data, which may not be available in a real-world scenario due to privacy concerns. We present in this work the QuantFace solution based on low-bit precision format model quantization. QuantFace reduces the required computational cost of the existing face recognition models without the need for designing a particular architecture or accessing real training data. QuantFace introduces privacy-friendly synthetic face data to the quantization process to mitigate potential privacy concerns and issues related to the accessibility to real training data. Through extensive evaluation experiments on seven benchmarks and four network architectures, we demonstrate that QuantFace can successfully reduce the model size up to 5x while maintaining, to a large degree, the verification performance of the full-precision model without accessing real training datasets.

READ FULL TEXT VIEW PDF

page 1

page 2

06/21/2022

SFace: Privacy-friendly and Accurate Face Recognition using Synthetic Data

Recent deep face recognition models proposed in the literature utilized ...
10/14/2020

Towards Accurate Quantization and Pruning via Data-free Knowledge Transfer

When large scale training data is available, one can obtain compact and ...
02/03/2015

DeepID3: Face Recognition with Very Deep Neural Networks

The state-of-the-art of face recognition has been significantly advanced...
03/31/2021

Q-ASR: Integer-only Zero-shot Quantization for Efficient Speech Recognition

End-to-end neural network models achieve improved performance on various...
08/24/2019

SeesawFaceNets: sparse and robust face verification model for mobile platform

Deep Convolutional Neural Network (DCNNs) come to be the most widely use...
07/24/2019

QRMODA and BRMODA: Novel Models for Face Recognition Accuracy in Computer Vision Systems with Adapted Video Streams

A major challenge facing Computer Vision systems is providing the abilit...
07/16/2020

Semi-Siamese Training for Shallow Face Learning

Most existing public face datasets, such as MS-Celeb-1M and VGGFace2, pr...

I Introduction

Recent high performing deep neural networks (DNN) rely on over-parameterized networks with high computational cost [21, 23]. State-of-the-art (SOTA) face recognition (FR) models followed this common trend by relying on over-parameterized DNN [16, 6, 10]. However, deploying such extremely large models with hundreds of megabytes (MB) of memory requirements on embedded devices and other use-cases constrained by the computational capabilities and high throughput requirements is still challenging [36, 17].

Enabling FR on domains limited with computational capabilities requires designing a special architecture or compressing the current solutions to meet the computational requirements of the deployment environments. Several efficient FR models have been proposed in the literature [11, 35, 9, 56]. The core idea of many of these works depended on utilizing efficient architecture designed for common computer version tasks such as MixNets, [49], MobileNetV2 [45], ShuffleNet [34], VarGNet [60] for FR [5, 11, 35, 56]. Very recently, few of works [9, 51] proposed architectures based on face-specific neural architecture search (NAS) for efficient FR. However, none of all previous efficient FR works [11, 35, 5, 56, 9, 51] explored the potential of using model quantization to reduce the computational cost of existing widely used FR architectures, e.g. ResNet [21, 16] or SE-ResNet [23, 10].

Model quantization approaches compress the DNN by reducing the number of bits required to represent each weight, e.g., using a lower precision format than full-precision (FP) floating-point, such as 4-bit [3, 57] or 8-bit [31, 27] signed integer. Such methods have shown great success in reducing the computation cost of DNN and are supported by most deep learning accelerators [40, 1]. Model quantization enables performance gains in different areas. First, it reduces the model size, which can be directly measured using the number of bits required to represent each parameter. For example, applying model quantization using 8bit bandwidth on ResNet100 [21, 16]

(65.2M parameters) reduces the model size from 261.2MB to 65.3 MB. Second, many deep learning accelerators such as Pytorch

[40]

and TensorFlow

[1] can run a quantized model faster than a FP one. For example, Pytorch [40] can run a quantized model 2-4x faster than the FP model and reduce the required memory bandwidth by 2-4x [40, 31]. However, the exact inference speed and memory bandwidth are highly dependent on the underlying hardware and deep learning accelerator [40]. Once the model is quantized, the model weights and quantization parameters need to be tuned and calibrated. These processes commonly require access to the training data, either entirely or partially [14].

Fig. 1:

An overview of the proposed QuantFace framework. Given a Gaussian noise Z from a normal distribution, the pretrained generator produces a fake data sample and feeds it into the FP and the b-bit quantized model. The KD loss is then computed between the normalized feature embeddings of the FP model and the quantized model.

This need for data if quantization is used on FR networks follows that of deep FR models reliance on large-scale training datasets [11, 5, 16] such as MS1M [20], VGGFace2 [10] and CASIA-WebFace [58]. Existing efficient FR solutions are not different as they also require face image databases, whether for conventional training and/or knowledge distillation (KD) from teacher networks [11, 35, 5, 56, 9, 51]. Most of the recent face datasets used in the literature have been collected from the web [43]. According to [43], 45 of face datasets have been created after 2014, and around 78% of these datasets are derived from the web. However, it is not a trivial task, and it may not be feasible to further collect face datasets for biometric processing from the web due to legal privacy issues in some countries [8, 42, 50, 15]. Privacy regulations, just as the GDPR [50], gives individuals the right to withdraw their consent to store or use their private data [38, 50], a process that can practically be very challenging when a database is widely distributed, which puts the privacy rights of individuals in jeopardy. Following such concerns, datasets such as VGGFace2 [10] and CASIA-WebFace [58] are not anymore publicly accessible in many countries. Companies like Facebook announced that they will shut down their FR system due to such privacy concerns [41]. This motivated several recent works to explore the potential of using synthetically generated face data [8, 42, 44, 15].

This work presents contributions towards enhancing the compactness of FR models while maintaining high accuracy in a privacy-friendly process that does not require real face databases. Towards that, this work is, to the best of our knowledge, the first to regulate the FR computational cost by applying quantization-aware training. We additionally propose a training paradigm involving KD that uses synthetically generated face data to fine-tune the quantized model and adapt the quantization operator parameters. We empirically prove the success of the proposed approach in reducing the bit bandwidth up to 5x of the evaluated models while maintaining, to a large degree, the model verification performance. Additionally, the use of synthetic data within the proposed training paradigm proved to be highly effective and produced quantized models that performed competitively with models quantized using real data, even outperforming them in many experimental setups. Our quantized models based on synthetic data did outperform other full precision models with higher memory footprints, as will be demonstrated later in this work.

Ii Related works

There is a large body of works that proposed efficient FR models based on designing compact convolution building blocks [11, 35, 5, 56]. Such efficient FR models followed those of deep image classification [49, 45, 34], which in terms evolved from the depthwise separable convolution [13]. Additionally, efficient FR models [11, 35, 5, 56] opted to replace the fully connected (FC) layer on the top of CNN with a global depthwise convolution to reduce the large number of parameters caused by the FC layer. MobileFaceNets [11] is a popular network architecture that has been widely adopted in different compact FR solutions [33, 17]. MobileFaceNets contains around one million trainable parameters with 443M FLOPs. MobileFaceNets architecture is based on the residual bottlenecks proposed by MobileNetV2 [45] and depth-wise separable convolutions layer. VarGFaceNet [56] deployed the variable group convolutional network proposed by VarGNet [60] to design a compact FR model with 5m trainable parameters. ShuffleFaceNet [35] is a compact FR model based on ShuffleNet-V2 [34]. MixFaceNets [5] proposed a family of efficient FR models by extending the MixConv block [49] with a channel shuffle operation [34] aiming at increasing the discriminative ability of MixFaceNets.

An alternative to previous handcrafted DNN designs is utilizing NAS, KD, or a combination of different model compression techniques. KD transfers the acquired knowledge learned by a larger network to a smaller one [22, 26]. KD has shown great success in improving the verification performance of compact FR models [9, 56, 7]. Furthermore, the combination of KD with other model compression techniques such as compact model design [56], or NAS [51, 9] demonstrated very promising accuracies in FR.

One must note that all the discussed FR models are built with FP single floating-point, and none has adopted model quantization techniques. Additionally, all these works required the privacy-sensitive use of real face data, either during their conventional training or during the KD from a larger pre-trained network. This stresses the main contributions of this work, i.e., the use of unlabeled synthetic data in the proposed quantization paradigm.

Iii Methodology

This work presents a privacy-friendly framework to minimize the computational complexity of deep learning-based FR models. Towards that, given an FR model with a FP floating point, we propose to quantize the weights and activations to b-bit precision format through uniform quantization aware training. The quantization process commonly requires fine-tuning/retraining the quantized model to adjust the quantization operator parameters and recover the model accuracy after applying the quantization process. Quantization usually requires the original model training data, or part of it, for this calibration or fine-tuning [28]. This original data may not be accessible after training the model due to restrictions related to privacy, security, data ownership, and data availability restrictions (e.g., model shared from a data owner to a third party). To mitigate this challenge and promote privacy-aware solutions, our proposed framework utilizes synthetically generated face data to fine-tune and calibrate the quantized model. Moreover, to maintain the performance and enable the use of unlabeled synthetic data, the proposed framework combines the quantization process with KD during the fine-tuning phase. This step enables fine-tuning the quantized model without accessing the real training dataset or any prior assumption about the training dataset labels. Figure 1 presents an overview of the proposed framework.

This section presents first the quantization process applied to the FP floating-point FR model. Then, it presents the training paradigm utilized to fine-tune the quantized model.

Iii-a Model Quantization

Quantization involves two main operators: Quantize and Dequantize. Let be a real value and is the bit width of a low precision format. and are minimum and maximum values of . A possible integer values can be represented using b-bit format and the value range of a b-bit signed integer precision format is between . A real value of a FP model in this work refers to a 32-bit single-precision floating-point. Quantization maps a real value to lie within the range value of low precision b-bit . Quantization operator consists of two processes: value transformation and clipping process. Formally, the transformation process that maps into can be defined as follows:

(1)

where is the rounding method that defines the rounding step in which a value is rounded up or down to an integer. is a constant parameter (zero-point) of the same type as the quantized value. It represents the quantized value corresponding to the real zero value. is a real-valued scaling factor that divides the real value range into a number of fractions. In asymmetric quantization (), the scaling factor and zero-point are defined as follows [28]:

(2)
(3)

The clipping process clips (the output of Equation 1) to lie within the range . The clip operation can be defined as follows:

(4)

where and are the minimum and maximum values of the quantization value range. The quantization of real value to a b-bit precision format is given by:

(5)

Dequantization operation that approximates the real value of the quantized one is defined as follows:

(6)

Tensor Quantization Granularity

Tensor quantization granularity defines how the quantization operator parameters are calculated and shared among the model tensors [18]. Quantization granularity [18] can be categorized into three groups: per-layer [30], per-group of channels [48] and per-channel [25, 28, 30, 59]. In this work, we opt to use the popular choice of per-channel granularity as it provides a high quantization resolution and it has repeatedly led to high accuracy [25, 28, 30, 59]. Per-channel granularity uses a fixed quantization parameter for each channel, independently from other channels.

Model param
Quantization
data
Bits Size (MB) LFW CFP AgeDb-30 CALFW CPLFW IJB-C IJB-B
Accuracy (%) TAR@FAR 1e-4 (%)
ResNet100 65.2M - FP32 261.22 99.83 98.40 98.33 96.13 93.22 96.50 95.25
Real data w8a8 65.31 99.80 98.31 98.13 96.05 92.92 96.38 95.13
Synthetic data w8a8 65.31 99.80 98.14 97.95 96.02 92.90 96.09 94.74
Real data w6a6 49.01 99.55 89.14 95.85 95.42 85.63 85.80 84.08
Synthetic data w6a6 49.01 99.45 91.00 96.43 95.58 86.60 87.00 85.06
ResNet50 43.6M - FP32 174.68 99.80 98.01 98.08 96.10 92.43 95.74 94.19
Real data w8a8 43.67 99.78 97.70 98.00 96.00 92.17 95.66 94.15
Synthetic data w8a8 43.67 99.78 97.43 97.97 95.87 92.08 95.18 93.67
Real data w6a6 32.77 99.70 95.00 97.17 95.87 90.17 91.74 90.07
Synthetic data w6a6 32.77 99.68 95.17 97.43 95.70 90.38 90.72 89.44
ResNet18 24.0M - FP32 96.22 99.67 94.47 97.13 95.70 89.73 93.56 91.64
Real data w8a8 24.10 99.63 94.46 97.03 95.72 89.48 93.56 91.57
Synthetic data w8a8 24.10 99.55 94.04 97.07 95.58 89.53 92.87 91.01
Real data w6a6 18.10 99.52 93.23 96.55 95.58 88.37 93.03 91.08
Synthetic data w6a6 18.10 99.55 93.34 96.62 95.32 89.05 92.36 90.38
MobileFaceNet 1.1M - FP32 4.21 99.47 91.59 95.62 95.15 87.98 90.88 88.54
Real data w8a8 1.10 99.43 91.40 95.47 95.05 87.95 90.57 88.32
Synthetic data w8a8 1.10 99.35 90.84 94.37 94.78 87.73 89.21 86.98
Real data w6a6 0.79 98.87 87.69 93.03 93.30 84.57 83.13 80.53
Synthetic data w6a6 0.79 99.08 87.64 91.77 93.48 84.85 82.94 80.58
TABLE I: B-bit precision data vs. performance on LFW , CFP , AgeDb-30 , CALFW, CPLFW (Accuracy %), IJB-C and IJB-B (TAR at FAR 1e-4). The results is reported for FP model (32-bit), quantized models to 8-bit (w8a8) and 6-bit (w6a6) using real and synthetic data. All decimal points are rounded up to two decimal places. The top verification performances under the same quantization settings (network architecture and bit bandwidth) are in bold.

Iii-B Quantization-Aware Training (QAT)

QAT, utilized in QuantFace, is a common approach to adjust the quantized model parameters [53, 28]

. This adjustment is often required for model quantization because rounding the weights of a pre-trained model often results in lower accuracy, especially if the weights and the activation functions have a wide range of values. QAT inserts fake (simulated) quantization operations in the network to emulate inference-time quantization

[28]. QAT requires training or fine-tuning the model. In QAT, the forward and backward passes are usually carried out in a floating-point precision, and the model weights and activations are quantized after each gradient update [28]

. After each training iteration, the derivatives of the quantized network weights need to be calculated to compute the loss gradients for backpropagation. However, the gradients of the fake quantized operations are predominantly zero

[31], making the standard backpropagation not applicable. QuantFace follows the common QAT approaches [31, 28] by addressing this issue by

using Straight Through Estimator (STE)

[4] to approximate the gradient of fake quantization operators using a threshold, where the derivatives of fake quantization operators are set to one for inputs within the clipping range, i.e. when is in the range .

Iii-C Training paradigm

QAT requires access to the original labeled training dataset to fine-tune the quantized model [28], which may be infeasible for privacy concerns [12]. Very recently, a number of works proposed to fine-tune a quantized model with generated data from conditional generative models [27, 55]. However, these works [27, 55] required to generate data with labels that are aware of the data labels used to train the FP models. Such knowledge of training class labels is often unobtainable, given only a pre-trained FR model. To solve this, we propose a solution that utilizes unlabeled synthetic data along with a KD-based training paradigm that wavers the requirement of training data labels. Unlike conventional KD [22] that optimizes the classification output (require labeled data), the utilized KD-based training paradigm optimizes the feature representation needed for biometric verification i.e. given a batch of unlabeled face images, the quantized model is fine-tuned to learn producing feature representation similar to the one learned by the full precision model as will be detailed in this section.

We propose to use synthetically generated data from a Generative Adversarial Network (GAN)

[19, 29] to fine-tune the quantized model. We sample noise from

a Gaussian distribution

and feed it into a pretrained generator to generate unlabeled synthetic data , as shown in Figure 1. Formally, the synthetically generated data is obtained by:

(7)

Then, the synthetically generated data is aligned and cropped (See Section IV). This data is unlabeled, as the random produces random identities unrelated to those used to train the FP model. Thus, the model cannot be directly fine-tuned with the generated data. Given this restriction, we propose in this work to fine-tune the quantized model using KD from the FP model. Specifically, the quantized model is trained to learn feature embedding similar to the ones from the FP model in the normalized embedding space. During the fine-tuning phase, a batch of size of unlabeled synthetic images is sampled and fed into the quantized and full precision model to obtain feature embeddings and , respectively. Then, and are normalized ( ,) and used to compute the loss based on the cosine distance between the normalized features. Finally, the gradient of the loss function is computed and used to update the weight parameters. Different from supervised classification losses e.g. cross-entropy or conventional KD losses [22] that require class label for calculating the loss value, loss is calculated based on the feature embedding layers. Thus, it mitigates the need to have identity labels for the input training images. Formally, the loss is defined as follows:

(8)

Using synthetic data to train FR model might lead to sub-optimal verification performances [42] due to a possible domain gap between the real and synthetic data [54, 46, 32], especially if the model is trained from scratch to learn identity representation [42], which might require conducting domain adaption e.g. adversarial domain adaptation [54, 42] to reduce such effect. However, our goal in this work is not to learn identity representation from scratch through optimizing classification loss e.g. cross entropy, we rather fine-tune the pretrained model with synthetic data to adjust the model weights and quantization parameters after applying quantization process. Thus, by utilizing the proposed training paradigm, we restrict our use of the synthetic data to ensure that the response of the pretrained quantized model to input during the fine-tuning phase is similar to the response of the full precision model to the same input . We demonstrate that this process is not largely effected by the potential domain gap between the real and synthetic data by (1) comparing the model responses to real and synthetic data i.e. activation function value ranges of two models fine-tuned with real and synthetic data 2, respectively and by (2) comparing the evaluation results of these models on real data benchmarks (details in Section V-C).

Iv Experimental setup

This section presents the baseline models with implementation details, model quantization implementation details, and evaluation benchmarks used in this work.

Iv-a Baselines

The FP models in this work are trained with ArcFace loss [16] on MS1MV2 dataset [20, 16]. The MS1MV2 is a refined version of the MS-Celeb-1M [20] by [16] containing 5.8M images of 85K identities. The baseline backbones are ResNet100 [21, 16], ResNet50 [21, 16], ResNet18 [21, 16], and MobileFaceNet [11]. We follow the training setting of [16] to set the scale parameter to 64 and the margin to 0.5. We set the mini-batch size to 512 and trained the presented models on one Linux machine (Ubuntu 20.04.2 LTS) with Intel(R) Xeon(R) Gold 5218 CPU 2.30GHz, 512G RAM, and four Nvidia GeForce RTX-6000 GPUs. The FP models and the quantization operators in this paper are implemented using Pytorch [40]

. All models are trained with Stochastic Gradient Descent (SGD) optimizer. We used random horizontal flipping with a probability of 0.5 for data augmentation during the training. We set the momentum to 0.9 and the weight decay to 5e-4. All the images in the evaluation and training datasets are aligned and cropped to

, as described in [16]. The initial learning rate of the FP models is 0.1, and it is divided by 10 at 100K and 160K training iterations, following [16]. The training is stopped after 180K iterations.

Iv-B Quantization implementation details

We quantize the weights and activations of all the baseline FP models to two, 6-bit and 8-bit, precision formats. We reported the results of the quantized models under two settings. First, the quantized models are fine-tuned and calibrated with the original (real) training data, MS1MV2 [20, 16] (described in Section IV-A

). Second, the quantized models are fine-tuned and calibrated with the synthetically generated data. We utilized the official open source implementation

222https://github.com/NVlabs/stylegan2-ada of StyleGAN2-ADA to randomly generate 0.5M synthetic face images. These images are then cropped and aligned using the method described in Section IV-A. In both settings, the quantized models are fine-tuned for 11K iterations with a learning rate of 1e-4.

(a) ResNet100 6-bit
(b) ResNet100 8-bit
Fig. 2: Correlation between the activation functions (Act) value ranges () of the ResNet100 quantized using real (solid orange) and synthetic data (dashed black). The y-axis represents the depth of the backbone activation function e.g. depth 1 is the first activation function. These plots represent a fixed-precision quantization bit bandwidth of 6-bit (1(a)) and 8-bit (1(b)). Each line in the plot represents the range value of the activation function where the start point is and end point is . The high correlation indicates that the quantized model is able to capture sufficient data information from the synthetic data, in comparison to real data.

Iv-C Evaluation benchmarks and metrics

The evaluation results of the FP and quantized models are reported on seven mainstream benchmarks: Labeled Faces in the Wild (LFW)

[24], AgeDB-30 [39], Celebrities in Frontal-Profile in the Wild (CFP-FP) [47], Cross-age LFW (CALFW) [62], Cross-Pose LFW (CPLFW) [61], IARPA Janus Benchmark–C and B (IJB-C) [37] and (IJB-B) [52]

. We follow the evaluation metrics defined in the utilized benchmarks as follows: LFW (accuracy), CA-LFW (accuracy), CP-LFW (accuracy), CFP-FP (accuracy), AgeDB-30 (accuracy), IJB-C, and IJB-B (true acceptance rate at a false acceptance rate of 1e-4, noted as TAR at FAR1e-4).

(a) LFW
(b) cfp
(c) AgeDb-30
(d) CALFW
(e) CPLFW
(f) IJB-C
(g) IJB-B
Fig. 3: The model size (in MB) vs. performance on LFW (accuracy), CFP (accuracy), AgeDB-30 (accuracy), CALFW (accuracy), CPLFW (accuracy), IJB-C (TAR at FAR1e-4), and IJB-B (TAR at FAR1e-4). The FP models are marked with a cross. The 8bit and 6bit quantized models using synthetic data are marked with full circles and stars, respectively. The models , after quantization, only lose marginal performance while their sizes are significantly reduced. Quantized models (e.g. ResNet100(w8a8), blue circle) in most cases outperforms larger full precision models (e.g. ResNet50 (green cross) and ResNet18 (red cross)).

V Results

Table I presents the achieved FR performance results by the FP models (ResNet100, ResNet50, ResNet18, and MobileFaceNet), along with the achieved ones by the quantized models to 8-bit weights and 8-bit activation (noted as w8a8), and to 6-bit weights and 6-bit activation (noted as w6a6) using synthetic or real data quantization. The results are grouped by each network architecture. In each group of rows, the results are first presented for the FP model (baseline), followed by the quantized models. The size (in MB) of the FP models is approximately 4x the number of parameters, i.e., each parameter requires 4 bytes. In both real and synthetic data quantization settings, the reductions in bit bandwidth, and thus the model size, using w8a8 and w6a6, are around 4x and 5.3x, respectively. Also, model quantization enables performance gains in inference speed and memory bandwidth. However, the exact measures of inference speed and memory bandwidth depend on the underlying hardware and deep learning accelerator, as we discussed in Section I. Therefore, the presented results in this section are discussed as a trade-off between FR performance and the bit bandwidth, and thus, model size. Figure 3 presents the trade-off between the model size (in MB) and the achieved verification performance by the FP floating-point 32 models (FP32) and their respective quantized models using synthetically generated data. The model that has the best trade-off between the verification performance and model size tends to be on the top left in the plot 3. The following observations can be made based on the achieved results in Table I:

V-a Impact of 8-bit bandwidth quantization

When the model is quantized to 8-bit (w8a8 setting), the achieved verification performances in all experimental settings are slightly degraded. However, the bit bandwidth is significantly reduced (around 4x) when the considered models are quantized. For example, the achieved accuracy by ResNet100 (261.22 MB) on AgeDb-30 is 98.33%. This accuracy slightly dropped to 98.13% and 97.95% when the ResNet-100 is quantized to 8-bit (65.31 MB) and fine-tuned with real and synthetic data, respectively. Similar observations can be made when all considered models are quantized to 8-bit. Another important observation can be drawn from the achieved results: when ResNet100 is quantized to 8-bit and fine-tuned with synthetic data (65.31 MB), it significantly outperformed ResNet18 (96.22 MB) on all considered benchmarks, resulting in around 30% less model size. Impressively, the ResNet100 quantized to 8-bit outperformed the FP ResNet50 on most benchmarks while being more than 60% smaller. This observation can be visually seen by comparing the (x,y) positions of blue circle (ResNet100(w8a8)) and green cross (ResNet50) marks in the trade-off plots of Figure 3. On large-scale evaluation benchmarks, the achieved TAR at FAR 1e-4 on IJB-C by quantized models using real and synthetic data is very competitive to the FP model. For example, the achieved verification performance by the FP ResNet19 is 93.56% TAR at FAR 1e-4, and the achieved performance by the quantized models with real and synthetic data are 93.56% and 92.87%, respectively.

V-B Impact of 6-bit bandwidth quantization

Using 6-bit bandwidth, the achieved results by quantized models are, as expected, lower than the ones achieved by the 8-bit quantization. However, the reduction in model size is significantly higher than the 8-bit quantization. However, using 6-bit bandwidth can still achieve competitive results to the FP model for use-cases that are extremely limited with computational cost. Moving below 6-bit bandwidth, e.g., 4-bit, our experiments showed that none of the considered models converged during the fine-tuning process.

V-C Impact of quantization data source

The quantized models fine-tuned with synthetic data achieved very competitive results to the ones fine-tuned with real data, and even, in many cases, the quantized models fine-tuned with synthetic data outperformed the quantized model with real data. This is especially true for the 6-bit bandwidth, as shown in Table I. For example, using 6-bit bandwidth, the achieved accuracies on LFW by ResNet100 quantized with real and synthetic data are 99.55% and 99.45%, respectively. To illustrate the correlation between the quantized models using real and synthetic data, Figure 2 presents the activation functions value range variables () of quantized models using real and synthetic data. The high correlation between the activation functions value range of the quantized model using real and synthetic data can be noticed from the overlap in the value range. This indicates that the quantized model is able to capture sufficient data information from the synthetic data to match the FP model output.

Vi Conclusion

This work is the first to explore the potential of regulating the computational cost of existing deep face recognition using low-bit format model quantization in a privacy-friendly process. In particular, once the model is quantized, synthetically generated face data from unconditional GAN is fed into the FP and quantized model. Then, the proposed training paradigm matches the feature embeddings of the FP and quantized model in a normalized embedding space. The reported results pointed out the effectiveness of the presented approach in regulating the computational cost of the face recognition model without accessing the original training data or any prior knowledge about the actual data used to train the FP model.

Acknowledgment

This research work has been funded by the German Federal Ministry of Education and Research and the Hessen State Ministry for Higher Education, Research and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE. This work has been partially funded by the German Federal Ministry of Education and Research (BMBF) through the Software Campus Project.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016)

    TensorFlow: A system for large-scale machine learning

    .
    In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016, pp. 265–283. Cited by: §I.
  • [2] (2019) Advances in neural information processing systems 32: annual conference on neural information processing systems 2019, neurips 2019, december 8-14, 2019, vancouver, bc, canada. External Links: Link Cited by: 3.
  • [3] R. Banner, Y. Nahshan, and D. Soudry (2019) Post training 4-bit quantization of convolutional networks for rapid-deployment. See 2, pp. 7948–7956. External Links: Link Cited by: §I.
  • [4] Y. Bengio, N. Léonard, and A. C. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    .
    CoRR abs/1308.3432. External Links: Link, 1308.3432 Cited by: §III-B.
  • [5] F. Boutros, N. Damer, M. Fang, F. Kirchbuchner, and A. Kuijper (2021) MixFaceNets: extremely efficient face recognition networks. In International IEEE Joint Conference on Biometrics, IJCB 2021, Shenzhen, China, August 4-7, 2021, pp. 1–8. External Links: Link, Document Cited by: §I, §I, §II.
  • [6] F. Boutros, N. Damer, F. Kirchbuchner, and A. Kuijper (2022-06) ElasticFace: elastic margin loss for deep face recognition. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    ,
    pp. 1578–1587. Cited by: §I.
  • [7] F. Boutros, N. Damer, K. Raja, F. Kirchbuchner, and A. Kuijper (2022) Template-driven knowledge distillation for compact and accurate periocular biometrics deep-learning models. SensorsarXiv preprint 22 (5). External Links: Link, ISSN 1424-8220 Cited by: §II.
  • [8] F. Boutros, M. Huber, P. Siebke, T. Rieber, and N. Damer (2022) SFace: privacy-friendly and accurate face recognition using synthetic data. Cited by: §I.
  • [9] F. Boutros, P. Siebke, M. Klemt, N. Damer, F. Kirchbuchner, and A. Kuijper (2022) PocketNet: extreme lightweight face recognition network using neural architecture search and multistep knowledge distillation. IEEE Access 10 (), pp. 46823–46833. External Links: Document Cited by: §I, §I, §II.
  • [10] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman (2018) VGGFace2: A dataset for recognising faces across pose and age. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018, pp. 67–74. External Links: Link, Document Cited by: §I, §I, §I.
  • [11] S. Chen, Y. Liu, X. Gao, and Z. Han (2018) MobileFaceNets: efficient cnns for accurate real-time face verification on mobile devices. In CCBR 2018, Urumqi, China, August 11-12, 2018, Proceedings, Lecture Notes in Computer Science, Vol. 10996, pp. 428–438. External Links: Link, Document Cited by: §I, §I, §II, §IV-A.
  • [12] Y. Choi, J. P. Choi, M. El-Khamy, and J. Lee (2020) Data-free network quantization with adversarial knowledge distillation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14-19, 2020, pp. 3047–3057. External Links: Document Cited by: §III-C.
  • [13] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1800–1807. External Links: Document Cited by: §II.
  • [14] Y. Choukroun, E. Kravchik, F. Yang, and P. Kisilev (2019) Low-bit quantization of neural networks for efficient inference. In 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019, pp. 3009–3018. External Links: Link, Document Cited by: §I.
  • [15] N. Damer, C. A. F. López, M. Fang, N. Spiller, M. V. Pham, and F. Boutros (2022-06) Privacy-friendly synthetic data for the development of face morphing attack detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 1606–1617. Cited by: §I.
  • [16] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019) ArcFace: additive angular margin loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 4690–4699. External Links: Document Cited by: §I, §I, §I, §I, §IV-A, §IV-B.
  • [17] J. Deng, J. Guo, D. Zhang, Y. Deng, X. Lu, and S. Shi (2019) Lightweight face recognition challenge. In 2019 IEEE/CVF ICCV, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019, pp. 2638–2646. External Links: Document Cited by: §I, §II.
  • [18] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer (2021) A survey of quantization methods for efficient neural network inference. CoRR abs/2103.13630. External Links: Link, 2103.13630 Cited by: §III-A.
  • [19] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 2672–2680. Cited by: §III-C.
  • [20] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao (2016) MS-celeb-1m: A dataset and benchmark for large-scale face recognition. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, Lecture Notes in Computer Science, Vol. 9907, pp. 87–102. External Links: Document Cited by: §I, §IV-A, §IV-B.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. External Links: Link, Document Cited by: §I, §I, §I, §IV-A.
  • [22] G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. CoRR abs/1503.02531. External Links: Link, 1503.02531 Cited by: §II, §III-C.
  • [23] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 7132–7141. External Links: Document Cited by: §I, §I.
  • [24] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller (2007-10) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 07-49, University of Massachusetts, Amherst. Cited by: §IV-C.
  • [25] Q. Huang, D. Wang, Z. Dong, Y. Gao, Y. Cai, T. Li, B. Wu, K. Keutzer, and J. Wawrzynek (2021) CoDeNet: efficient deployment of input-adaptive object detection on embedded fpgas. In FPGA ’21: The 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Virtual Event, USA, February 28 - March 2, 2021, pp. 206–216. External Links: Document Cited by: §III-A.
  • [26] M. Huber, F. Boutros, F. Kirchbuchner, and N. Damer (2021) Mask-invariant face recognition through template-level knowledge distillation. In 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), Vol. , pp. 1–8. External Links: Document Cited by: §II.
  • [27] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713. Cited by: §I, §III-C.
  • [28] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. G. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 2704–2713. External Links: Document Cited by: §III-A, §III-A, §III-B, §III-C, §III.
  • [29] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila (2020) Training generative adversarial networks with limited data. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Cited by: §III-C.
  • [30] R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: A whitepaper. CoRR abs/1806.08342. External Links: Link, 1806.08342 Cited by: §III-A.
  • [31] R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §I, §III-B.
  • [32] S. Lee, E. Park, H. Yi, and S. H. Lee (2020) StRDAN: synthetic-to-real domain adaptation network for vehicle re-identification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14-19, 2020, pp. 2590–2597. External Links: Document Cited by: §III-C.
  • [33] X. Li, F. Wang, Q. Hu, and C. Leng (2019) AirFace: lightweight and efficient model for face recognition. In 2019 IEEE/CVF ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019, pp. 2678–2682. External Links: Link, Document Cited by: §II.
  • [34] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) ShuffleNet V2: practical guidelines for efficient CNN architecture design. In ECCV 2018, Munich, Germany, September 8-14, 2018, Proceedings, Part XIV, Lecture Notes in Computer Science, Vol. 11218, pp. 122–138. External Links: Document Cited by: §I, §II.
  • [35] Y. Martínez-Díaz, L. S. Luevano, H. M. Vazquez, M. Nicolás-Díaz, L. Chang, and M. González-Mendoza (2019) ShuffleFaceNet: A lightweight face architecture for efficient and highly-accurate face recognition. In 2019 IEEE/CVF ICCVW, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019, pp. 2721–2728. External Links: Document Cited by: §I, §I, §II.
  • [36] Y. Martínez-Díaz, M. Nicolás-Díaz, H. Méndez-Vázquez, L. S. Luevano, L. Chang, M. Gonzalez-Mendoza, and L. E. Sucar (2021) Benchmarking lightweight face architectures on specific face recognition scenarios. Artificial Intelligence Review, pp. 1–44. Cited by: §I.
  • [37] B. Maze, J. C. Adams, J. A. Duncan, N. D. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney, and P. Grother (2018) IARPA janus benchmark - C: face dataset and protocol. In 2018 International Conference on Biometrics, ICB 2018, Gold Coast, Australia, February 20-23, 2018, pp. 158–165. External Links: Document Cited by: §IV-C.
  • [38] B. Meden, P. Rot, P. Terhörst, N. Damer, A. Kuijper, W. J. Scheirer, A. Ross, P. Peer, and V. Struc (2021) Privacy-enhancing face biometrics: A comprehensive survey. IEEE Trans. Inf. Forensics Secur. 16, pp. 4147–4183. External Links: Document Cited by: §I.
  • [39] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kotsia, and S. Zafeiriou (2017) AgeDB: the first manually collected, in-the-wild age database. In 2017 IEEE CVPRW, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1997–2005. External Links: Document Cited by: §IV-C.
  • [40] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Cited by: §I, §IV-A.
  • [41] J. Pesenti (2021)(Website) External Links: Link Cited by: §I.
  • [42] H. Qiu, B. Yu, D. Gong, Z. Li, W. Liu, and D. Tao (2021-10) SynFace: face recognition with synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10880–10890. Cited by: §I, §III-C.
  • [43] I. D. Raji and G. Fried (2021) About face: A survey of facial recognition evaluation. CoRR abs/2102.00813. External Links: Link, 2102.00813 Cited by: §I.
  • [44] C. Rong, X. Zhang, and Y. Lin (2020) Feature-improving generative adversarial network for face frontalization. IEEE Access 8, pp. 68842–68851. External Links: Link, Document Cited by: §I.
  • [45] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 4510–4520. External Links: Document Cited by: §I, §II.
  • [46] S. Sankaranarayanan, Y. Balaji, A. Jain, S. Lim, and R. Chellappa (2018) Learning from synthetic data: addressing domain shift for semantic segmentation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 3752–3761. External Links: Document Cited by: §III-C.
  • [47] S. Sengupta, J. Chen, C. D. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs (2016) Frontal to profile face verification in the wild. In 2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016, Lake Placid, NY, USA, March 7-10, 2016, pp. 1–9. External Links: Document Cited by: §IV-C.
  • [48] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer (2020) Q-BERT: hessian based ultra low precision quantization of BERT. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 8815–8821. Cited by: §III-A.
  • [49] M. Tan and Q. V. Le (2019) MixConv: mixed depthwise convolutional kernels. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019, pp. 74. Cited by: §I, §II.
  • [50] P. Voigt and A. v. d. Bussche (2017) The eu general data protection regulation (gdpr): a practical guide. 1st edition, Springer. External Links: ISBN 3319579584, 9783319579580 Cited by: §I.
  • [51] X. Wang (2021) Teacher guided neural architecture search for face recognition. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 2817–2825. Cited by: §I, §I, §II.
  • [52] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. C. Adams, T. Miller, N. D. Kalka, A. K. Jain, J. A. Duncan, K. Allen, J. Cheney, and P. Grother (2017) IARPA janus benchmark-b face dataset. In 2017 IEEE CVPRW, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 592–600. External Links: Link, Document Cited by: §IV-C.
  • [53] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng (2016)

    Quantized convolutional neural networks for mobile devices

    .
    In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 4820–4828. External Links: Document Cited by: §III-B.
  • [54] M. Xu, J. Zhang, B. Ni, T. Li, C. Wang, Q. Tian, and W. Zhang (2020) Adversarial domain adaptation with domain mixup. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 6502–6509. Cited by: §III-C.
  • [55] S. Xu, H. Li, B. Zhuang, J. Liu, J. Cao, C. Liang, and M. Tan (2020) Generative low-bitwidth data free quantization. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XII, pp. 1–17. External Links: Document Cited by: §III-C.
  • [56] M. Yan, M. Zhao, Z. Xu, Q. Zhang, G. Wang, and Z. Su (2019) VarGFaceNet: an efficient variable group convolutional neural network for lightweight face recognition. In 2019 IEEE/CVF ICCVW, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019, pp. 2647–2654. External Links: Link, Document Cited by: §I, §I, §II, §II.
  • [57] Z. Yao, Z. Dong, Z. Zheng, A. Gholami, J. Yu, E. Tan, L. Wang, Q. Huang, Y. Wang, M. W. Mahoney, and K. Keutzer (2021) HAWQ-V3: dyadic neural network quantization. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, pp. 11875–11886. External Links: Link Cited by: §I.
  • [58] D. Yi, Z. Lei, S. Liao, and S. Z. Li (2014) Learning face representation from scratch. CoRR abs/1411.7923. External Links: Link, 1411.7923 Cited by: §I.
  • [59] D. Zhang, J. Yang, D. Ye, and G. Hua (2018) LQ-nets: learned quantization for highly accurate and compact deep neural networks. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VIII, pp. 373–390. External Links: Document Cited by: §III-A.
  • [60] Q. Zhang, J. Li, M. Yao, L. Song, H. Zhou, Z. Li, W. Meng, X. Zhang, and G. Wang (2019) VarGNet: variable group convolutional neural network for efficient embedded computing. CoRR abs/1907.05653. External Links: Link, 1907.05653 Cited by: §I, §II.
  • [61] T. Zheng and W. Deng (2018-02) Cross-pose lfw: a database for studying cross-pose face recognition in unconstrained environments. Technical report Technical Report 18-01, Beijing University of Posts and Telecommunications. Cited by: §IV-C.
  • [62] T. Zheng, W. Deng, and J. Hu (2017) Cross-age LFW: A database for studying cross-age face recognition in unconstrained environments. CoRR abs/1708.08197. External Links: Link, 1708.08197 Cited by: §IV-C.