Face recognition has recently matured and achieved high accuracy against millions of identities [Xu_TIP2015, Xu_IJCB2011]
. A face recognition system is often designed in two main stages, i.e. feature extraction and feature comparison. The role of feature extraction is more important since it directly determines the robustness of the engine. This operator defines an embedding process mapping input facial images into a higher-level latent space where embedded features extracted from photos of the same subject distribute within a small margin[Luu_IJCB2011]. Moreover, since most face recognition engines are set into a blackbox mode to protect the technologies [Le_JPR2015], there is no apparent technique to inverse that embedding process to reconstruct the faces of a subject given his/her extracted features from those engines.
|Ours||NBNet [mai2018reconstruction]||SynNormFace [Cole_2017_CVPR]||IFaceRec [Zhmoginov2016InvertingFE]||INVREP [conf/cvpr/MahendranV15]|
|Input||Feat||Feat||Feat||Feat + Img||Feat|
Some Blackbox Adversarial Attack approaches [pmlr-v80-ilyas18a, ilyas2018prior, Thys_2019_CVPR_Workshops] have partially addressed this task by analyzing the gradients of the classifier’s outputs to generate adversarial examples that mislead the behaviour of that classifier. However, they only focus on a closed-set problem where the output classes are predefined. Moreover, their goal is to generate imperceptible pertubations added to the given input signal. Other methods [pmlr-v80-athalye18b, Cole_2017_CVPR, Dosovitskiy_2016_CVPR, NIPS2018_8052, Zhmoginov2016InvertingFE] are also introduced in literature but still require the access to the classifier structure, i.e. whitebox setting. Meanwhile, our goal focuses on a more challenging reconstruction task with a blackbox face recognition. Firstly, this process reconstructs faces from scratch without any hint from input images. Secondly, in a blackbox setting, there is no information about the engine’s structure, and, therefore, it is unable to directly exploit knowledge from the inverse mapping process (i.e. back-propagation). Thirdly, the embedded features from a face recognition engine are for open-set problem where no label information is available. More importantly, the subjects to be reconstructed may have never been seen during training process of the face recognition engine. In the scope of this work
, we assume that the face recognition engines are primarily developed by Convolutional Neural Networks (CNN) that dominate recent state-of-the-art results in face recognition[deng2019arcface, duong2019shrinkteanet, duong2018mobiface, LiuNIPS18, liu2017sphereface, schroff2015facenet, wang2018cosface, wen2016discriminative, zhang2019adacos]. We also assume that there is no further post-processing after the step of CNN feature extraction. We then develop a theory to guarantee the reconstruction robustness of the proposed method.
Contributions. This paper presents a novel generative structure, namely Bijective Generative Adversarial Networks in a Distillation framework (DiBiGAN), with Bijective Metric Learning for the image reconstruction task. The contributions of this work are four-fold. (1) Although many metric learning techniques have been introduced in the literature, they are mainly adopted for classification rather than reconstruction process. By addressing limitations of classifier-based metrics for image reconstruction, we propose a novel Bijective Metric Learning with bijection (one-to-one mapping) property so that the distances in latent features are equivalent to those between images (see Fig. 1). It, therefore, provides a more effective and natural metric learning approach to the image reconstruction task. (2) We exploit different aspects of the distillation process for the image reconstruction task in a blackbox mode. They include distilled knowledge from the blackbox face matcher and ID knowledge extracted from a real face structure. (3) We introduce a Feature-Conditional Generator Structure with Exponential Weighting Strategy for Generative Adversarial Network (GAN)-based framework to learn a more robust generator to synthesize realistic faces with ID preservation. (4) Evaluations on benchmarks against various face recognition engines have illustrated the improvements of DiBiGAN in both image realism and ID preservation. To the best of our knowledge, this is one of the first metric learning methods for image reconstruction (Table 1).
2 Related Work
Synthesizing images [Cole_2017_CVPR, Dosovitskiy_2016_CVPR, nhan2015beyond, duong2019deep, mai2018reconstruction, Zhmoginov2016InvertingFE] has brought several interests from the community. We divided into two groups, i.e. unrestricted and adversarial synthesis.
Unrestricted synthesis. The approaches focus on reconstructing an image from scratch given its high-level representation. Since the mapping is from a low-dimensional latent space to a highly nonlinear image space, several regularizations have to be applied, e.g. Gaussian Blur [yosinski2015understanding] for high-frequency samples or Total Variation [duong2019automatic, conf/cvpr/MahendranV15] for maintaining piece-wise constant patches. These optimization-based techniques are limited with high computation or unrealistic reconstructions. Later, Dosovitskiy et al. [Dosovitskiy_2016_CVPR]
proposed to reconstruct the image from its shallow (i.e. HOG, SIFT) and deep features using a Neural Network (NN). Zhmoginov et al.[Zhmoginov2016InvertingFE] presented an iterative method to invert Facenet [schroff2015facenet] feature with feed-forward NN. Cole et al. [Cole_2017_CVPR]
proposed an autoencoder structure to map the features to frontal neutral face of the subject. Yang et al.[Yang:2019:NNI:3319535.3354261] adopted autoencoder for model inversion task. Generally, to produce better synthesized quality, these approaches require full access to the deep structure to exploit the gradient information from the embedding process. Mai et al. [mai2018reconstruction] developed a neighborly deconvolutional network to support the blackbox mode. However, with only pixel and perceptual [johnson2016perceptual] losses, there are limitations of ID preservation when synthesizing different features of the same subject. In this work, we address this issue with Bijective Metric Learning and Distillation Knowledge for reconstruction task.
Adversarial synthesis. Adversarial approaches aim at generating unnoticable perturbations from input images for adversarial examples to mislead the behaviour of a deep structure. Either directly accessing or indirectly approximating gradients, adversarial examples are created by maximizing corresponding loss which can fool a classifier [pmlr-v80-athalye18b, Brunner2019CopyAP, Cheng2019ImprovingBA, pmlr-v80-ilyas18a, ilyas2018prior, Liu2016DelvingIT, MoosaviDezfooli2015DeepFoolAS, Shukla2019BlackboxAA, Thys_2019_CVPR_Workshops]. Ilyas et. al. [ilyas2018prior]
proposed bandit optimization to exploit prior information about the gradient of deep learning models. Later, Ilyas et. al.[pmlr-v80-ilyas18a] introduced Natural Evolutionary Strategies to enable query-efficient generation of black-box adversarial examples. Other knowledge from the blackbox classifier are also exploited for this task [Thys_2019_CVPR_Workshops, pmlr-v80-athalye18b, NIPS2018_8052]. Generally, although the approaches in this direction tried to extract the gradient information from a blackbox classifier, their goal are mainly to mislead the behaviours of the classifier with respect to a pre-defined set of classes. Therefore, they are closed-set approaches. Meanwhile, in our work, the proposed framework can reconstruct faces of subjects that have not been seen in the training process of the classifier.
3 Our Proposed Method
Let be a function that maps an input image from image domain to its high-level embedding feature in latent domain . In addition, a function takes as its input and gives the identity (ID) prediction of the subject in space where each dimension represents a predefined subject class.
Definition 1 (Model Inversion). Given blackbox functions and ; and a prediction score vector
; and a prediction score vectorextracted from an unknown image , the goal of model inversion is to recover from such that where denotes some types of distance metrics.
The approaches solving this problem usually exploit the relationship between an input image and its class label for the reconstruction process. Moreover, since the output score is fixed according to predefined classes, the reconstruction is limited on images of training subject IDs.
Definition 2 (Feature Reconstruction). Given a blackbox functions ; and its embedding feature of an unknown image , feature reconstruction is to recover from by optimizing .
Compared to the model inversion problem, Feature Reconstruction is more challenging since the constraints on predefined classes are removed. Therefore, the solution for this problem turns into an open-set mode where it can reconstruct faces other than the ones used for learning , i.e. face recognition engine. Moreover, since the parameters of are inaccessible due to its blackbox setting, directly recovering based on its gradient is impossible. Therefore, the feature reconstruction task can be reformulated via a function (generator) as the reverse mapping of .
where , denotes the parameters of , and
is the probability density function of. In other words, indicates the distribution that image belonged to (i.e. the distribution of training data of ). Intuitively, function can be seen as a function that maps images from embedding space back to image space such that all reconstructed images are maintained to be close to its real with respect to the distance metric . To produce “good quality” synthesis (i.e. realistic images with ID preservation), different choices for have been commonly exploited [isola2017image, johnson2016perceptual, mai2018reconstruction] such as pixel difference in image domain via
distance; Probability Distribution Divergence (i.e. Adversarial loss defined via an additional Discriminator) for realistic penalization; or Perceptual distance that penalizes the image structure in high-level latent space. Among these metrics, except the pixel difference that is computed directly in image domain, the others are indirect metrics where another mapping function (i.e. classifier) from image space to latent space is required.
3.1 Limitations of Classifier-based Metrics
Although these indirect metrics have shown their advantages in several tasks, there are limitations when only the blackbox function and its embedded features are given.
Limitation 1. As shown in several adversarial attack works [engstrom2019learning, santurkar2019computer], since the function is not a one-to-one mapping function from to , it is straightforward to find two images of similar latent representation that are drastically different in image content. Therefore, with no prior knowledge about the subject ID of image , starting to reconstruct it from scratch may easily fall into the case where the reconstructed image is totally different to but has similar embedding features. The current Probability Distribution Divergence with Adversarial Loss or Perceptual Distance is limited in maintaining the constrain “the reconstructions of features of the same subject ID should be similar”.
Limitation 2. Since the access to the structure and intermediate features of is unavailable in the blackbox mode, the function is unable to directly exploit valuable information from the gradient of and the intermediate representation during embedding. As a result, the distance metrics defined via
, i.e. perceptual distance, is less effective as in whitebox setting. Next sections will introduce two loss functions to tackle these problems to learn a robust function.
3.2 Bijective Metrics for Image Reconstruction
Many metric learning proposals for face recognition [deng2019arcface, LiuNIPS18, liu2017sphereface, wang2018cosface, wen2016discriminative, zhang2019adacos] have been used to improve both intra-class compactness and inter-class separability with a large margin. However, for feature reconstruction, directly adopting these metrics, e.g. angular distance for reconstructed images to cluster images of the same ID is infeasible.
Therefore, we propose a bijection metric for feature reconstruction task such that the mapping function from image to latent space is one-to-one. The distance between their latent features is equivalent to the distance between images. By this way, these metrics are more aligned to image domain and can be effectively adopted for reconstruction task. Moreover, since two different images cannot be mapped to the same latent features, the metric learning process is more reliable. The optimization of in Eqn. (1) is rewritten as:
where ; and
denotes a density function estimated from an alternative large-scale face dataset. Notice that although the access tois not available, this approximation can be practically adopted due to a prior knowledge about that images drawn from are facial images. Let define a bijection mapping from to a latent variable . With the bijective property, the optimization in Eqn. (2) is equivalent to.
where ; by the change of variable formula; is the Jacobian of with respect to ; and is the distance metric in . Intuitively, Eqn. (3) indicates that instead of computing the distance and estimating directly in image domain, the optimization process can be equivalently accomplished via the distance and density in according to the bijective property of .
The prior distributions . In general, there are various choices for the prior distribution and the ideal one should have two properties: (1) simplicity in density estimation, and (2) easily sampling
. Motivated from these properties, we choose Gaussian distribution for. Notice that other distribution types are still applicable in our framework.
The distance metric . With the choice of as a Gaussian, the distance between images in is equivalent to the deviation between Gaussians in latent space. Therefore, we can effectively define as the squared Wasserstein coupling distance between two Gaussian distributions.
where and are the means and covariances of and , respectively. The metric then can be extended with image labels to reduce the distance between images of the same ID and enhance the margin between different IDs.
where defines parameter controlling the margin between classes; and denote the subject ID of .
Learning the Bijection . In order to effectively learn the bijection , we adopt the structure of mapping function from [dinh2016density, Duong_2017_ICCV, duong2019learning] as the backbone for the tractable log-det computation with the log-likelihood loss for training process. Moreover, to further improve the discriminative property of in latent space , we propose to exploit the ID label in training process of . Particularly, given classes (i.e. ID) of the training set, we choose Gaussian distributions with different means and covariances and enforce samples of each class distributed on its own prior distribution, i.e. . Formally, the log-likelihood loss function to learn is formulated as .
3.3 Reconstruction from Distillation Knowledge
In the simplest approach, the generator can still learn to reconstruct image by adopting the Perceptual Distance as in previous works to compare and . However, as mentioned in Sec. 3.1, due to limited information that can be accessed from , “key” information (i.e. the gradients of as well as its intermediate representations) making the perceptual loss effective is lacking. Therefore, we propose to first distill the knowledge from the blackbox to a “student” function and then take advantages of these knowledge via for training the generator. On one hand, via the distillation process, can mimic by aligning its feature space to that of and keeping the semantics of the extracted features for reconstruction. On the other hand, with , the knowledge about the embedding process of (i.e. gradient, and intermediate representation) becomes more transparent; and, therefore, maximize the information which can be exploited from . Particularly, let and be the composition of -sub components. The knowledge from can be distilled to by aligning their extracted features as.
Then is enhanced via the distilled knowledge of both final embedding features and intermediate representation by.
where and denote the hyper-parameters controlling the balance between terms. The first component of aims to penalize the differences between the intermediate structure of the desired and reconstructed facial images while the second component validates the similarity of their final features.
3.4 Learning the Generator
Fig. 2 illustrates our proposed framework with Bijective Metric and Distillation Process to learn the generator .
Network Architecture. Given an input image , the generator takes as its input and aims to synthesize an image that is as similar to as possible in terms of identity and appearance. We adopt the GAN-based generator structure for and optimize using different criteria.
where denote the bijective, distillation, adversarial, and reconstruction losses, respectively. are their parameters controlling their relative importance. is a discriminator distinguishing the real images from a synthesized one. There are three main critical components in our framework including the Bijective , the student matcher for ID preservation; and the discriminator for realistic penalization. The Discriminator is updated with the objective function as.
is the random interpolation distribution between real and generated images[gulrajani2017improved]. Then, the whole framework is trained following GAN-based minimax strategy.
Learning Strategies. Besides the losses, we introduce a Feature-Conditional Structure for and a exponential Weighting Strategy to adaptively scheduling the importance factors between loss terms during training process.
Feature-conditional Structure. A natural design for the structure of is to directly use as the input for . However, this structure limits the learning capability of . Particularly, besides ID information, may include other “background” conditions such as poses, illuminations, expressions. Therefore, setting as the only input implicitly enforces to “strictly” model these factors as well. This makes the training process of less effective. To relax this constraint, we introduce a Feature-Conditional structure (i.e. the generator structure in Fig. 2
) where a random variableis adopted as an additional input so that these background factors can be modeled through . Moreover, we propose to use as the direct input to and inject the information from through out the structure. By this way, can act as the conditional ID-related information for all reconstruction scales and gives the better synthesis.
Exponential Weighting Strategy. As the progressive growing training strategy [karras2017progressive] initializes its learning process on synthesizing low-resolution images and then gradually increasing their levels of details, it is quite effective for enhancing the details of generated images in general. However, this strategy has limited capability in preserving the subject ID. In particular, in the early stages at low scales with blurry synthesis, it is difficult to control the subject ID of faces to be synthesized while in the later stages at higher scales when the generator becomes more mature and learns to add more details, the IDs of those faces have already been constructed and become hard to be changed. Therefore, we propose to adopt a exponential weighting scheme for (1) emphasizing on ID preservation in early stages; and (2) enhancing the realism in later stages. Particularly, the parameter set is set to where denotes the current scales of stage and is the maximum scales to be learned by .
4 Experimental Results
We qualitatively and quantitatively validate our proposed method in both reconstructed quality and ID preservation in several in-the-wild modes such as poses, expressions, and occlusions. Both image-quality and face-matching datasets are used for evaluations. Different face recognition engines are adopted to demonstrate the robustness of our model.
Data Setting. Our training data includes the publicly available Casia-WebFace [yi2014learning] with 490K labeled facial images of over 10K subjects. The duplicated subjects between training and testing sets are removed to ensure no overlapping between them. For validation, as commonly used for attribute learning and image quality evaluation, we adopt the testing split of 10K images from CelebA [liu2015faceattributes] to validate the reconstruction quality. For ID preservation, we explore LFW [huang2008labeled], AgeDB [moschoglou2017agedb], and CFP-FP [sengupta2016frontal] which provide face verification protocols against different in-the-wild face variations. Since each face matcher engine requires different preprocessing process, the training and testing data are aligned to the required template accordingly.
Network Architectures. We exploited the Generator structure of PO-GAN [karras2017progressive] with 5 convolutional blocks for while the Feature Conditional branch consists of 8 fully connected layers. The discriminator includes five consecutive blocks of two convolution and one downsampling operators. In the last block of , the minibatch-stddev operator followed by convolution and fully connected are also adopted. AdaIN operator [huang2017arbitrary] is applied for feature injection node. For the bijection , we set a configuration of 5 sub-mapping functions where each of them is presented with two 32-feature-map residual blocks. This structure is trained using the log-likelihood objective function on Casia-WebFace. Resnet-50 [he2016deep] is adopted for .
Our framework is implemented in TensorFlow and all the models are trained on a machine with four NVIDIA P6000 GPUs. The batch size is set based on the resolution of output images, for the very first resolution of output images (), the batch size is set to , the batch size will be divided by two when the resolution of images is doubled. We use Adam Optimizer with the started learning rate of . We experimentally set to maintain the balanced values between loss terms.
Ablation Study. To study the effectiveness of the proposed bijective metric for image reconstruction task, we employ an ablation study on MNIST [lecun1998mnist] with LeNet [lecun1998gradient] as the function . We also set to whitebox mode where is directly used in to remove the effects of other factors. Then 50K training images from MNIST and their feature vectors are used to train . Notice that since the image size is , and structures are configured with three convolutional blocks. The resulting distributions of synthesized testing images of all classes without and with are plotted in Fig. 3. Compared to learned with only classifier-based metrics (Fig. 3(a)), the one with bijective metric learning (Fig. 3(b)) is supervised with more direct metric learning mechanism in image domain, and, therefore, shows the advantages with enhanced intra- and inter-class distributions.
4.1 Face Reconstruction Results
This section demonstrates the capability of our proposed methods in terms of effectively synthesizing faces from subject’s features. To train DibiGAN, we adopt the ArcFace-Resnet100 [deng2019arcface] trained on 5.8M images of 85K subjects for function and extract the feature vectors for all training images. These features together with the training images are then used to train the whole framework. We divided the experiments in two settings, i.e. whitebox and blackbox, where the main difference is the visibility of the matcher structure during training process. In the whitebox mode is directly used in Eqn. (7) to evaluate while in the blackbox mode, is learned from through a distillation process as in Eqn. (6) and used for . The first row of Table 2 shows the matching accuracy of and using real faces on benchmarking datasets.
Face Reconstruction from features of frontal faces. After training, given only the deep features extracted from on testing images, the generator is applied to synthesize the subjects’ faces. Qualitative examples of our synthesized faces in comparison with other methods are illustrated in Fig. 4. As can be seen, our generator is able to reconstruct realistic faces even when their embedding features are extracted from faces with a wide range of in-the-wild variations. More importantly, in both whitebox and blackbox settings, our proposed method successfully preserves the ID features of these subjects. In whitebox setting, since the structure of is accessible, the learning process can effectively exploit different aspects of embedding process from and produce a generator that depicts better facial features of the real faces. For example, together with ID information, poses, glasses, or hair style from the real faces can also be recovered. On the other hand, although the accessible information is very limited in blackbox setting, the learned can still be enjoyed from the distilled knowledge of and effectively fill the knowledge gap with whitebox setting. In comparison to different configurations of NBNet [mai2018reconstruction], better faces in terms of image quality and ID preservation can be obtained by our proposed model.
|White-box Reconstruction||Black-box Reconstruction|
|Real Faces 111We report the accuracy of original matcher for whitebox setting and for blackbox setting.||0.305||3.008||99.78%||98.40%||97.10%||0.305||3.008||99.70%||96.80%||93.10%|
|(A) PO_GAN [karras2017progressive]||0.331||2.226||68.20%||63.42%||68.89%||0.315||2.227||66.63%||62.37%||65.59%|
Effect of expressions and occluded regions. Fig. 5 illustrates our synthesis from features of faces that contain both expressions and occlusions. Similar to previous experiment, our model robustly depicts realistic faces with similar ID features as in the real faces. Those reconstructed faces’ quality again outperforms NBNet in both realistic and ID terms. Notice that the success of robustly handling with those challenging factors comes from two properties: (1) The matcher was trained to ignore those facial variations in its embedding features; and (2) both bijective metric learning and distillation process can effectively exploit necessary knowledge from as well as real face distributions in image domain for synthesis process.
Effect of different features from the same subject. Fig. 6 illustrates the advantages of our method in synthesizing faces given different feature representations of the same subject. These results further show the advantages of the proposed bijective metric in enhancing the boundary between classes and constrain the similarity between reconstructed faces of the same subject in image domain. As a result, reconstructed faces from features of the same subject ID not only keep the features of that subject (i.e. similar to real faces) but also share similar features among each other.
Effect of random variable . As mentioned in Sec. 3.4, the variable is incorporated to model background factors so that can be more focused on modeling ID features. Therefore, by fixing the input feature and varying this variable values, different conditions of that face can be synthesized as shown in Fig. 7. These results further illustrate the advantages of our model structure in its capability of capturing various factors for the reconstruction process.
4.2 Face Quality and Verification Accuracy
In order to quantitatively validate the realism of our reconstructed images and how well they can preserve the ID of the subjects, three metrics are adopted: (1) Multi-scale Structural similarity (MS-SSIM) [odena2017conditional]; (2) Inception Score (IS) [salimans2016improved]; and (3) face verification accuracy.
Image quality. To quantify the realism of the reconstructed faces, we synthesize testing images of CelebA in several training configurations as shown in Table 2, where each loss function in cumulatively enables on the top of the previous configuration. Then MS-SSIM and IS metrics are applied to measure their image quality. We also compare our model in both whitebox and blackbox settings with other baselines including PO_GAN structure [karras2017progressive] and NBNet [mai2018reconstruction]. Notice that we only adopt the adversarial and reconstruction losses for configs (A) and (D). For all configs (A), (B), and (C), PO_GAN baseline takes only the embedding features as its input. These results show that in all configurations, our method maintains comparative reconstruction quality as PO_GAN and very close to that of real faces. Moreover, our synthesis consistently outperforms NBNet in both metrics.
ID Preservation. Our model is experimented against LFW, AgeDB, and CFP-FP where an image in each positive pair is substituted by the reconstructed one while the remaining image of that pair is kept as the reference real face. The matching accuracy is reported in Table 2. These results further demonstrate the advantages and contributions of each component in our framework. Compared to PO_GAN structure, our Feature-Conditional Structure gives more flexibility in modeling ID features, and achieves better matching accuracy. In combination with distilled knowledge from , the Generator produces a big jump in accuracy and close the gap to real faces to only 2.02% and 2.72% on LFW in whitebox and blackbox settings, respectively. By further incorporating the bijective metric, these gaps are further reduced to only 0.6% and 0.65% for the two settings.
Reconstructions against different Face Recognition Engines. To illustrate the accuracy of our propose structure, we validate the its performance against different face recognition engines as shown in Table 3. All Generators are set to blackbox mode and only the final extracted features are accessible. Our reconstructed faces are able to maintain the ID information and achieve competitive accuracy as real faces. These performance again emphasizes the accuracy of our model in capturing behaviours of the feature extraction functions and provides high quality reconstructions.
This work has presented a novel generative structure with Bijective Metric Learning for feature reconstruction problem to unveil the subjects’ faces given their deep blackboxed features. Thanks to the introduced Bijective Metric and Distillation Knowledge, our DibiGAN effectively maximizes the information to be exploited from a given blackbox face matcher. Experiments on a wide range of in-the-wild face variations against different face matching engines demonstrated the advantages of our method on synthesizing realistic faces with subject’s visual identity.