Two-dimensional face recognition has become extremely popular as it can be ubiquitously deployed and large datasets are available. In the past several years, tremendous progress has been achieved in making 2D approaches more robust and useful in real-world applications. Though 2D face recognition has surpassed human performance in certain conditions, challenges remain to make it robust to facial poses, uncontrolled ambient illumination, aging, low-light conditions, and spoofing attacks [Ding2016, kemelmacher2016megaface, Nech2017, taigman2014deepface]. In the present work we address some of these issues by enhancing the captured RGB facial image with 3D information as illustrated in Figure 1.
High resolution cameras became ubiquitous, although for 2D face recognition, we only need a facial image of moderate or low resolution. For example latest phones frontal camera have a very high resolution (e.g., pixels) while the resolution of the input to most face recognition systems is limited to pixels [arcface, parkhi2015deep, schroff2015facenet, taigman2014deepface, Zulqarnain2018]. This means that, in the context of face recognition, we are drastically underutilizing most of the resolution of captured images. We propose an alternative to use the discarded portion of the spectra and extract real 3D information by projecting a high frequency light pattern. Hence, a low resolution version of the RGB image remains approximately invariant allowing the use of standard 2D approaches, while 3D information is extracted efficiently from the local deformation of the projected patterns.
The proposed solution to extract 3D facial features has key differences with the two common approaches presented in existing literature: 3D hallucination [eigen2014depth, Huber2015, Liu2015, Pini2018] and 3D reconstruction [Zafeiriou2013, Zou2005]. We will discuss these differences in detail in the following section. We illustrate the main limitation of 3D hallucination in the context of face recognition in Figure 2
, which emphasizes the lack of real 3D information on a standard RGB input image. We demonstrate that it is possible to extract actual 3D facial features bypassing the ill-posed problem of explicit depth estimation. Our contributions are summarized as follows:
Analyzing the spectral content of thousands of facial images, we design a high frequency light pattern that simultaneously allow us to retrieve a standard 2D low resolution facial image plus a 3D gradient facial representation.
We propose an effective and modular solution that achieves 2D and 3D information decomposition and facial feature extraction in a data-driven fashion (bypassing a 3D facial reconstruction).
We show that by defining an adequate distance function in the space of the feature embedding, we can leverage the advantages of both 2D and 3D features. We can transparently exploit existing state-of-the-art 2D methods and improve their robustness, e.g., to spoofing attacks.
2 Related Work
To recognize or validate the identity of a subject from a 2D color photograph is a longstanding problem of computer vision and has been largely studied for over forty years[kaya1972basic, zhao2003face]arcface], VGG-Face [parkhi2015deep], DeepFace [taigman2014deepface], and FaceNet [schroff2015facenet].
In spite of this, spoofing attacks and variations in pose, expression and illumination are still active challenges and significant efforts are being made to address them [Cao2018, Hayat2017, He2018, Kumar2018, Lezama2017, Liu2018, Tran2017, Yu2017, Zhao2018, Zou2005]. For example, Deng et al. [Deng2018] attempt to handle large pose discrepancy between samples. To that end, they propose an adversarial facial UV map completion GAN. Complementing previous approaches that seek for robust feature representations, several works propose more robust loss and metric functions [Liu2017, Wang2018].
3D hallucination from single RGB.
To enhance 2D approaches a common trend is to hallucinate a 3D representation from an input RGB image which is used to extract 3D features [blanz2003face, Dou2017, eigen2014depth, Huber2015, Liu2015, Pini2018]. For example, Cui et al. [Cui2018] introduce a cascade of networks that simultaneously recover depth from an RGB input while seeking for separability of individual subjects. The estimated depth information is then used as a complementary modality to RGB.
3D face recognition.
The approaches described previously share an important practical advantage that at the same time is their weakness, they extract all the information from a standard (RGB) 2D photograph of the face. As depicted in Figure 2 a single image does not contain actual 3D information. To overcome this intrinsic limitation different ideas have been proposed and datasets with 3D facial information are becoming more popular [Zulqarnain2018]. For example, Zafeiriou et al. [Zafeiriou2013] propose a four-light source photometric stereo (PS). A similar idea is elaborated by Zou et al. [Zou2005] who propose to use active near-infrared illumination and combine a pair of input images to extract an illumination invariant face representation.
Despite the previous mentioned techniques, performing a 3D facial reconstruction is still a challenging and complicated task. Many strategies have been proposed to tackle this problem, including time delay based [marks1992system], image cue based [eigen2015predicting, eigen2014depth, laina2016deeper, prados2006shape, saxena2009make3d], and triangulation based methods [ayubi2010pulse, DiMartino2015one, li2014some, Rosman2016, zhang2006high]. Although there has been great recent development, available technology for 3D scanning is still too complicated to be ubiquitously deployed [di2018one, hartley2003multiple, zhang2010recent, zhang2013handbook].
The proposed solution has two key features that make it, to the best of our knowledge, different from existing alternatives. (a) Because the projected pattern is of a high spatial frequency, we can recover a standard (low resolution) RGB facial image that can be fed into state-of-the-art 2D face recognition methods. (b) We avoid the complicated task of 3D facial reconstruction and instead, extract local 3D features from the local deformation of the projected pattern. In that sense our ideas can be implemented exploiting existing and future 2D solutions. In addition, our approach is different from those that hallucinate 3D information. As discussed before and illustrated in Figure 2 this task requires a strong prior of the scene which is ineffective, for example, if a spoofing attack is presented (see the example provided in Figure 15 in the supplementary material).
3 Proposed Approach
Let denote the space of images with pixels and color channels, and
a space of n-dimensional column vectors (in the context of this work associated to a facial feature embedding).denotes the set of RGB images (), while is used to denote the space of two channel images () associated to the gradient of a single-channel image . (The first/second channel represents the partial derivative with respect to the first/second coordinate.)
Combining depth and RGB information.
The proposed approach consists of three main modules as illustrated in Figure 3: performs a decomposition of the input image into texture and depth information, , and extract facial features associated to the facial texture and depth respectively. These three components are illustrated in Figure 3 in blue, yellow, and green, respectively. (We decided to have three modules instead of a single end-to-end design for several reasons that will be discussed below.)
We denote the facial feature extraction from the input image as , where with . The subscript represent the parameters of the mapping , which can be decomposed in three groups , associated to the image decomposition, RGB feature extraction, and depth feature extraction respectively. In the following we discuss how these parameters are optimized for each specific task, which is one of the advantages of formulating the problem in a modular fashion.
Once texture and depth facial information is extracted into a suitable vector representation (as illustrated in Algorithm 1), we can select a distance measure to compare facial samples and estimate whether they have a high likelihood of belonging to the same subject or not. It is worth noticing that faces are embedded into a space in which the first half of the dimensions are associated to information extracted from the RGB representation while the other half codes depth information. These two sources of information may have associated different confidence levels (depending on the conditions at deployment). We address this in detail in Section 3.3 and propose an anisotropic distance adapted to our solution, and capable of leveraging the good performance of 2D solutions in certain conditions, while improving robustness and handling spoofing attacks in a continuous and unified fashion.
3.1 Pattern design.
When a pattern of light is projected over a surface with a height map , it is perceived by a camera located along the -axis with a deformation given by (). A detailed description of active stereo geometry is provided in the supplementary material Section B. Let us denote the image we would acquire under homogeneous illumination, and the intensity profile of the projected light. Without loss of generality we assume the system baseline is parallel to the axis. The image acquired by the camera when the projected light is modulated with a profile is
We will restrict to periodic modulation patterns and let denote the pattern spatial period, we also define . To simplify the system design and analysis, lets also restrict to periodic patterns that are invariant to the coordinate. In these conditions we can express where represent the coefficients of the Fourier series of . (Note that because of the invariance with respect to the coordinate, the coefficients are constant instead of a function of .) Equation (1) can be expressed as
Defining , Equation (2) can be expressed as [takeda1983fourier]
Applying the 2D Fourier Transform (FT) in both sides of Equation (3) and using standard properties of the FT [distributions] we obtain
We denote as the FT of and use
to represent the 2D frequency domain associated toand axis respectively.
Equation (4) shows that the FT of the acquired image can be decomposed into the components centered at . In the context of this section, we refer to a function being smooth if
Assuming and are smooth (we empirically validate this hypothesis below), the components can be isolated as illustrated in Figure 4. The central component is of particular interest, captures the facial texture information and can be recovered from if is large enough (we provide a more precise quantitative analysis in what follows). On the other hand, relative (gradient) 3D information can be retrieved from the components as we show in Proposition 1.
Gradient depth information is encoded in the components .
We define the wrapping function . This function wraps the real set into the interval [ghiglia1998unwrapping]. This definition can be extended to vector inputs wrapping the modulus of the vector field while keeping its direction unchanged, i.e., if and if . From and we can compute111We assume images are extended in an even fashion outside the image domain, to guaranteed that and avoid an additional offset term.
where denotes the wrapped version of . Moreover, with (wrapping introduces shifts of magnitude multiple of ). Computing the gradient both sides leads to where . Assuming the magnitude of the gradient of is bounded by and considering that , we can apply the wrapping function both sides of the previous equality to obtain which proves (recall Equation (6)) that the gradient of can be extracted from the components and . To conclude the proof, we use the property of linearity of the gradient operation and the fact that is proportional to the depth map of the scene (see Equation (12) and Section B in the supplementary material). ∎
Analytic versus data-driven texture and gradient depth extraction
The previous analysis shows that closed forms can be obtained to extract texture and depth gradient information. However, to compute these expressions is necessary to isolate different spectral components . To that end, filters need to be carefully designed. The design of these filters is challenging, e.g., one need to control over-smoothing versus introducing ringing artifact which are drastically amplified by a posterior gradient computation [DiMartino2015one, zhang2006high]. To overcome these challenges, we chose to perform a depth (gradient) and texture decomposition in a data-driven fashion, which as we show in Section 4, provides an efficient and effective solution.
Bounds on and optimal spectral orientation.
As discussed above, the projected pattern should have a large fundamental frequency . In addition, the orientation of the fringes and the system baseline can be optimized if faces present a narrower spectral content in a particular direction. We study the texture and depth spectrum of the facial images of ND-2006 dataset (this dataset provides ground truth facial texture and depth information). We observed (see Figure 5) that for facial images sampled at a spatial resolution, most of the energy is concentrated in a third of the discrete spectral domain (observe the extracted one dimensional profiles of the spectrum shown at the left side of Figure 5). In addition, we observe that the spectral content of facial images is approximately isotropic. See, for example, Figure 5 and observe how for 1-dimensional sections across different orientations the 2D spectra envelope is almost constant. We conclude that the orientation of the fringes does not play a significant role in the context of facial analysis. In addition, we conclude that the fringes width should be smaller than mm (distance measure over the face).222This numerical results is obtained by approximating the bounding box of the face as a region, sampled with pixels which corresponds to a pixel length of , a third of the spectral band correspond to signal of a period of pixels which leads to a binary fringe of at least wide.
3.2 Network training and the advantages of modularity.
As described previously, the parameters of the proposed solution can be split in three groups . This is an important practical property and we designed the proposed solution to meet this condition (in contrast to an end-to-end approach).
Let us define , , and three datasets containing ground truth depth information, ground truth identity for rgb facial images, and ground truth identity for depth facial images, respectively. More precisely, , , and , where denotes a (facial or generic) RGB image acquired under the projection of the designed pattern, represents (facial or generic) standard RGB images, denotes a gray image representing the depth of the scene, and a scalar integer representing the subject id.
We denote as the RGB and gradient depth components estimated by the decomposition operation . We partitioned the parameters of into two sets of dedicated kernels , the first group focuses on retrieving the texture component while the second group retrieves the depth gradient. These parameters can be optimized as
(We also evaluated training a shared set of kernels trained with an unified loss, this alternative is harder to train in practice, due to the natural difference between the dynamic range and sparsity of gradient images compared with texture images.)
For texture and depth facial feature extraction, we tested models inspired in the Xception architecture [chollet2017xception]). Additional details are provided in the supplementary material Section D
. To train these models we add an auxiliary fully connected layer on top of the facial embedding (with as many neurons as identities in the train set) and minimize the cross-entropy between the ground truth and the predicted labels. More precisely, let us denotethe output of the fully connected layer associated to the embedding where
denotes the probability associated to the id,
where denotes the indicator function. (Of course one can choose other alternative losses to train these modules, see e.g., [arcface, Liu2017, Wang2018, Zheng2018].)
As described above, the proposed design allows to leverage information from three types of datasets (, , ). This has an important practical advantage as 2D facial and 3D generic datasets are more abundant, and the pattern dependant set can be of modest size as .
3.3 Distance design.
Once different modules are set we can compute the facial embedding of test images following the procedure described in Algorithm 1. Let us define and the feature embedding of two facial images and respectively. Recall that the first elements of are associated to features extracted from (a recovered) RGB facial image while the remaining elements are associated to depth information, i.e., .
We define the distance between two feature representations , and as
denotes the cosine distance, sets the relative weight of RGB and depth features, and define a non-linear response for the distance between depth features. As we will describe in the following, this provides robustness against common cases of spoofing attacks.
Intuitively, allows us to set the relative confidence associated to RGB and depth features, for example, gives the same weight to RGB and depth features, while () ignores the distance between samples in the depth (RGB) embedding space. This is important in practice, as is common to obtain substantially more data to train RGB models than depth ones (). This suggests that in good test conditions (e.g., good lighting) one may trust more RGB features over depth features (). As we will empirically show in the following section, when two facial candidates are compared, is an effective distance choice. However, it does not handle robustly common cases of spoofing attacks. The most common deployments of spoofing attacks imitate the facial texture more accurately than the facial depth [OULU_NPU_2017, Liu2018spoofing, Zhang2016], therefore, the global distance between two samples should be large when the distance of the depth features is large (i.e. above a certain threshold). To that end, we introduce an additional non-linear term controlled by parameters and , for the standard cosine distance dominates while for large values the distance will be amplified in a non-linear fashion.
4 Experiments and Discussion
Three public dataset are used for experimental validation: FaceScrub [FaceScrub], CASIA Anti-spoofing [zhang2012face], and ND-2006 [nd2006]. FaceScrub contains RGB (2D) facial images of different subjects, and is used to train the texture-based facial embedding. CASIA dataset contains genuine videos (recording a person) and videos of different types of spoofing attacks, the data was collected for subjects. We use this dataset to simulate and imitate the texture properties of images of spoofing attacks. ND-2006 is one of the larges publicly available datasets with 2D and 3D facial information, it contains images of subjects. We used this set to demonstrate that differential 3D features can be extracted from a single RGB input, to compare RGB features with 3D features extracted from the differential 3D input, and to show that when 2D and 3D information is properly combined, the best properties of each can be obtained.
Texture and differential 3D decomposition.
In Section 3.1 we discussed how real 3D information and texture information can be coded and later extracted using a single RGB image. In addition, we argue that this decomposition can be learned efficiently and effectively in a data-driven fashion. To that end, we tested simple network architectures composed of standard convolutional layers (a full description of these architectures and the training protocols are provided as supplementary material). Using ground truth texture and depth facial information, we simulated the projection of the designed pattern over the subjects provided in ND-2006 dataset. Illustrative results are presented in Figure 6 and in the supplementary material. The 3D geometrical model and a detailed description of the simulation process is provided in Section D.1. Though the simulation of the deformation of a projected pattern can be computed in a relatively simple manner (if the depth information is known), the inverse problem is analytically hard [DiMartino2015one, Rosman2016, zhang2013handbook].
Despite the previous, we observed that a stack of convolutional layers can efficiently learn how to infer from the image with the projected pattern, both depth gradient information, and the standard (2D) facial image. Figure 7 illustrates some results for subjects in the test set. The first column corresponds to the input to the network, the second column the ground truth texture information, and the third column the retrieved texture information. The architecture of the network and the training protocol is described in detail in the supplementary material Section D. As we can see in the examples illustrated in Figure 7, an accurate low resolution texture representation of the face can be achieved in general, and visible artifact are observed only in the regions where the depth is discontinuous (see for example, the regions illustrated at the bottom of Figure 7).
Figure 8 illustrates the ground truth and the retrieved depth gradient (again, for random samples from the test set). To estimate the 3D information, we feed to a different branch of convolutional layers the gray version of the input image. These layers are fully described in the supplementary material Table 5. A gray input image is considered instead of a color one because the projected pattern is achromatic, and therefore, no 3D information is encoded in the colors of the image. In addition, we crop the input image to exclude the edges of the face. (Facial registration and cropping is performed automatically using dlib [king2009dlib] facial landmarks.) As discussed in Section 3, and in particular, in the proof of Proposition 1, the deformation of the projected fringes only provide local gradient information if the norm of the gradient of the depth is bounded. In other words, where the scene present depth discontinuities, no local depth information can be extracted by our proposed approach. This is one of the main reasons why differential 3D information can be exploited for face recognition, while bypassing the more complicated task of a 3D facial reconstruction.
One of the advantages of the proposed approach is that it extracts local depth information, and therefore, the existence of depth discontinuities does not affect the estimation on the smooth portion of the face. This is illustrated in Figure 9 (a)-(b), where a larger facial patch is fed into the network. The decomposition module is composed exclusively of convolutional layers, and therefore, images of arbitrary size can be evaluated. Figure 9-(a) shows the input to the network, and Figure 9-(b) the first channel of the output (for compactness we display only the x-partial derivative). As we can see, the existence of depth discontinuities does not affect the prediction in the interior of the face (we consider the prediction outside this region as noise and we replace it by for visualization).
Several algorithms have been proposed to hallucinate 3D information from a 2D facial image [eigen2014depth, Huber2015, Liu2015, Pini2018]. In order to verify that our decomposition network is extracting real depth information (in lieu of hallucinating it from texture cues), we simulated an image where the pattern is projected over a surface with identical texture but with a planar 3D shape (as in the example illustrated in Figure 2). Figure 9 (a) shows the image acquired when the fringes are projected over the ground truth facial depth, and (c) when instead the depth is set to (without modifying the texture information). The first component of the output (x-partial derivative) is shown in (b) and (d), as we can see, the network is actually extracting true depth information (from the deformation of the fringes) and not hallucinating 3D information from texture cues. (As we will see next, this property is particularly useful for joint face recognition and spoofing prevention.)
2D and 3D face recognition.
Once the input image is decomposed into a (standard) texture image and depth gradient information, we can proceed to extract 2D and 3D facial features from each component. To this end, state-of-the-art network architectures are evaluated. Our method is agnostic to the RGB and depth feature extractors, moreover, as the retrieved texture image is close to a standard RGB facial images (in sense of the L2-norm), any pre-train 2D feature extractor can be used (e.g., [arcface, parkhi2015deep, schroff2015facenet, taigman2014deepface, Zulqarnain2018]). In the experiments presented in this section we tested a network based on the Xception architecture [chollet2017xception] (details are provided as supplementary material). For the extraction of texture features, the network is trained using FaceScrub [FaceScrub] dataset (as we previously described, this is a public dataset of 2D facial images). The module that extracts 3D facial features is trained using of the subjects of ND-2006 dataset, leaving the remaining subjects exclusively for testing. The output of each module is a 512-dimensional feature vector (see, e.g., Figure 3), hence the concatenation of 2D+3D features leads to a 1024-dimensional feature vector. Figure 10 illustrates a 2D embedding of the texture features, the depth features, and the combination of both. The 2D mapping is learned by optimizing the t-SNE [maaten2008visualizing] over the train partition, then a random subset of test subjects are mapped for visualization. As we can see, 3D features favor the compactness and increase the distance between clusters associated to different subjects.
To test the recognition performance, the images of the test subjects are partitioned into two sets: gallery and probe. For all the images in both sets, the 2D and 3D feature embedding is computed (using the pre-trained networks described before). Then, for each image in the probe set, the nearest neighbors in the gallery set are selected. The distance between each sample (in the embedding space) is measured using the distance defined in Section 3, Equation (11). For each sample in the probe set, we consider the classification as accurate, if at least one of the
nearest neighbors is a sample from the same subject. The Rank-n accuracy is the percentage of samples in the probe set accurately classified.
Figure 11 and Table 1 show the Rank-n accuracy when: only 2D features (), only 3D features (), or a combination of both () is considered. As explained in Section 3.3, the value of can be used to balance the weight of texture and depth features. As we can see, in all the cases a combination of texture and depth information outperforms each of them individually. This is an expected result as classification tends to improve when independent sources of information are combined [Kuncheva2004]. is an hyper-parameter that should be set depending on the conditions at deployment. In our particular experiments the best results are obtained for , which suggests that RGB features are slightly more reliable than depth features. This is an expected result as the module that extract RGB features is typically trained in a much larger datasets (2D facial images became ubiquitous). We believe this may change if, for example, testing is performed under low light conditions [Lezama2017]. Testing this hypothesis is one of the potential path for future research. In the experiment discussed so far, we ignored the role of and (i.e., we set and ). As we will discuss in the following, these parameters become relevant to achieve jointly face recognition and spoofing prevention.
|RGB baseline ()|
|Depth baseline ()|
Robustness to spoofing attacks.
Spoofing attack are simulated to test face recognition models, in particular, how robust these frameworks are under (unseen) spoofing attacks. As in the present work we focus on the combination of texture and depth based features, the simulation of spoofing attacks must account for realistic texture and depth models. The models for the synthesis of spoofing attacks are described in detail in the supplementary material Section D.3.
Figure 12 illustrates spoofing samples (first four rows) and genuine samples (bottom five rows). The first two columns correspond to the ground truth texture and depth information, the third column illustrates the input to our system, and the last three columns correspond to the outputs of the decomposition network. These three last images are fed into the feature extraction modules for the extraction of texture and depth based features respectively, as illustrated in Figure 3. It is extremely important to highlight, that spoofing samples are included exclusively at testing time. In other worlds, during all the training process the entire framework is agnostic to the existence of spoofing examples. If the proposed framework is capable of extracting real 3D facial features, it should be inherently robust to most common types of spoofing attacks.
As discussed before, the combination of texture and depth based features improves recognition accuracy. On the other hand, when spoofing attacks are included, we observe that texture based features are more vulnerable to spoofing attacks (see for example figure 12 and 14). To simultaneously exploit the best of each feature component, we design a non-linear distance as described in Equation (11). Figure 13 illustrates the properties of the defined distance for different values of and . As it can be observed, for those genuine samples (relative distances lower than ) the non linear component can be ignored and the distance behave as the euclidean distance with a relative modulation set by . On the other hand, if the distance between the depth components is above the threshold , it will dominate the overall distance achieving a more robust response to spoofing attacks.
To quantitatively evaluate the robustness against spoofing attacks, spoofing samples are generated for all the subjects in the test set. As before, the test set is separated into a gallery and a probe set and the generated spoofing samples are aggregated into the probe set. For each image in the probe set, the distance to a sample of the same subject in the gallery set is evaluated. If this distance is below a certain threshold , the image is labeled as genuine, otherwise, the image is labeled as spoofing. Comparing the classification label with the ground truth label we obtain the number of true positive (genuine classified as genuine), false positive (spoofing classified as genuine), true negative (spoofing classified as spoofing), and false negative (genuine classified as spoofing). Changing the value of the threshold we can control the number of false positive versus the number of false negatives as illustrated in Figure 14.
Figure 14 shows the ratio of false positive and false negative for . As before the distance between the samples is computed using the definition provided in (11), in blue/red the RGB/depth baseline is illustrated, the other set of curves (displayed in green tones) correspond to a combination of texture and depth features with and different values of and . In Table 2 the ratio of true positive is reported for a fixed ratio of false positives. The ACER measure (last column) corresponds to the average between the ratio of spoofing and genuine samples misclassified.
|TPR% @FPR=||TPR% @FPR=||ACER %|
|RGB baseline ()|
|Depth baseline ()|
Testing variations on the ambient illumination.
To test the impact of variations on lighting conditions we simulated test samples under different ambient illumination, implementation details are described in the supplementary material Section D.4. Table 3 compares the rank-5 accuracy of 2D features and 2D+3D features as the power of the ambient illumination increases. As described in the supplementary material, the ambient illumination is modeled with random orientation, and therefore, the more powerful the illumination is the more diversity between the test and the gallery samples is introduced.
|RGB baseline ()|
In the present experiments, we assumed that both the projected pattern and the ambient illumination have similar spectral content. In practice, one can project the pattern, e.g., on the infrared band. This would make the system invisible to the user, and reduce the sensitivity of 3D features to variations on the ambient illuminations. We provide a hardware implementation feasibility study and illustrate how the proposed ideas can be deployed in practice in the supplementary material Section E.
Improving state of the art 2D face recognition.
To test how the proposed ideas can impact the performance of state-of-the-art 2D face recognition systems, we evaluated our features in combination with texture based features obtained with ArcFace [arcface]. ArcFace is a powerful method pre-trained on very large datasets, on ND-2006 examples it achieves perfect recognition accuracy (100% rank-1 accuracy). When ArcFace is combined with the proposed 3D features, the accuracy remains excellent (100% rank-1 accuracy), i.e., adding the proposed 3D features does not negatively affects robust 2D solutions. On the other hand, 3D features improve ArcFace on challenging conditions as we discuss in the following. Interesting results are observed when ArcFace is tested under spoofing attacks, as we show in Table 4, ArcFace fails to detect spoofing attacks. ArcFace becomes more robust when it is combined with 3D features, improving from nearly TPR@FPR() to . In summary, as 2D methods improve and become more accurate, our 3D features do not affect them negatively when they work well, while improve their robustness in challenging situations.
|TPR% @FPR=||TPR% @FPR=||ACER %|
|ArcFace + 3D|
We proposed an effective and modular alternative to enhance 2D face recognition methods with actual 3D information. A high frequency pattern is designed to exploit the high resolution cameras ubiquitous in modern smartphones and personal devices. Depth gradient information is coded in the high frequency spectrum of the captured image while a standard texture facial image can be recovered to exploit state-of-the-art 2D face recognition methods. We show that the proposed method can be used to simultaneously leverage 3D information and texture information. This allows us to enhance state-of-the-art 2D methods improving their accuracy and making them robust, e.g., to spoofing attack.
Work partially supported by ARO, ONR, NSF, and NGA.
Appendix A Limitations of 3D hallucination in the context of face recognition.
As discussed in Section 2 3D hallucination methods have intrinsic limitations in the context of face recognition. To complement the example illustrate in Figure 2, here we show 5 3D facial models obtained by hallucinating 3D from a single RGB input image. To that end, we apply the 3D morphable model (extremely popular in facial applications). As we can see, even in the case of a planar spoofing attack, a face-like 3D shape is retrieved despite that this is far form the actual 3D shape of the actual scene. This is an expected results, as we discuss in Section 2 the problem of 3D hallucination is ill-posed, and therefore priors need to be enforced in order to obtain feasible implementations.
Appendix B Review of Active Stereo Geometry
When a structured pattern of light is projected over a surface, it is perceived with a certain deformation depending on the shape of the surface. For example, the top left image in Figure 4 is acquired while horizontal stripes are projected over the subject face. As we can see, fringes are no longer perceived as horizontal and parallel to each other, this deformation codes rich information about the geometry of the scene (which will be exploited to enhance 2D face recognition methods). Figure 16 sketches the geometry of this situation as if we were looking at the face from the side (vertical section). On the left, a light source project a ray of light trough the points (red line), when this ray is reflected by a reference plane at the point , it is viewed by the camera at the pixel location represented by . If instead, light is projected over an arbitrary (non-planar) surface, the reflection is produced at point and viewed by the camera at the shifted position . From now on, we denote the length of the shift as the disparity .
Similarity between triangles - and - leads to and . Defining (baseline between the lighting source and the camera sensor), (camera focal length), (surface local height), (distance of the subject to the camera), and assuming that we obtain,
Equation (12) quantitatively relates the perceived shift (disparity) of the projected pattern and the surface 3D shape . As the goal of the present work is to provide local 3D features instead of performing an actual 3D reconstruction, the particular value of , and are irrelevant as we shall see. Moreover, we exploit the fact that the gradient of the local disparity codes information of the depth gradient. This allows us to extract local geometrical features bypassing the more challenging steps of 3D reconstruction: global matching and gradient field integration [zhang2013handbook, di2018one, ghiglia1998unwrapping, hartley2003multiple].
Appendix C Gradient Information is Easy to Compute, Absolute Depth is Hard.
One of the key ideas of the presented approach is to estimate local depth descriptors instead of absolute depth information. While the former is an easier task and absolute information is irrelevant for the sake of feature extraction. After all, a robust feature representation should be scale and translation invariant.
More precisely, we can define the problem of integrating the absolute depth from an empirical estimation of its gradient map in a 2D domain as the optimization problem:
The solution of Equation (13) is non trivial. Noise, shadows, and facial discontinuities produce empirical gradient estimations with irotational components, i.e., for some pixels the rotor of the field is different from zero . Therefore, the space of target functions and the minimization norm must be carefully set in order to achieve a meaningful solution to Equation (13). This is a complex mathematical problem and has been extensively studied in the literature [agrawal2005algebraic, agrawal2006range, du2007robust, reddy2009enforcing, tumblin2005want].
Appendix D Implementation details
d.1 Light projection
From ground truth depth and texture facial information, images under the projection different high frequency patterns can be simulated as illustrated in Figure 18. Model the physical configuration of the system as described in Section B, and different parameters for the baseline and the fringe width were tested. We assumed in all our experiments a fixed focal length. The pseudo-code for the generation of samples under active illumination is summarized in Algorithm 2. It is important to highlight that thought the problem of simulating the pattern deformation from the depth is easy, the opposite is a very hard problem. This is why we design a DNN-based approach that estimates from the deformation of the pattern the gradient of the depth rather than the depth itself (see Section 3).
d.2 Networks architecture
Table 5 illustrates the architecture of the network that performs texture and depth information decomposition (as described in Section 3). Table 6 illustrates the architecture of the layers trained for facial feature extraction (illustrated as yellow/green block in Figure 3
). Each module of the proposed framework is implemented using standard tensorflow layers (version 1.13).
|Input:||image with proj. fringes.|
|(A) RGB - branch|
|Layer A.1:||Conv - kernels , BN, LeakyRelu.|
|Layer A.2:||Conv - kernels , BN, LeakyRelu.|
|Layer A.3:||Conv - kernels , BN, LeakyRelu.|
|Layer A.4:||Conv - kernels , BN, LeakyRelu.|
|Output A:||recovered texture.|
|(B) Depth - branch|
|Layer B.1:||Average(axis=3) (convert to gray).|
|Layer B.2:||Conv - kernels , BN, LeakyRelu.|
|Layer B.3:||Conv - kernels , BN, LeakyRelu.|
|Layer B.4:||Conv - kernels , BN, LeakyRelu.|
|Layer B.5:||Conv - kernels , BN, LeakyRelu.|
|Output B:||recovered .|
). Conv denotes convolution layer, and BN batch normalization. Standard tensorflow (v1.13) layers are used.
|Input:||( for texture, for depth image).|
|Layer 1.1:||Conv - kernels stride
, IN, Relu.
|Layer 1.2:||Conv - kernels stride , IN, Relu.|
|Layer A1.1:||Conv - kernels stride 2, IN.|
|Layer B1.1:||SepConv - kernels , IN, Relu.|
|Layer B1.2:||SepConv - kernels , IN, Relu.|
|Layer B1.3:||MaxPooling .|
|Layer 2:||Path A1 + Path B1 (output B1.3 + output A1.1)|
|Layer A2.1:||Conv - kernels stride 2, IN.|
|Layer B2.1:||SepConv - kernels , IN, Relu.|
|Layer B2.2:||SepConv - kernels , IN, Relu.|
|Layer B2.3:||MaxPooling .|
|Layer 3:||x = Path A2 + Path B2 (output B2.3 + output A2.1)|
|Middle flow:||x: (Repeat 2 times)|
|Layer 4.1:||SepConv - kernels , IN, Relu.|
|Layer 4.2:||SepConv - kernels , IN, Relu.|
|Layer 4.3:||SepConv - kernels , IN, Relu.|
|Layer 4.4:||Add(x, output Layer )|
|Layer A5.1:||Conv - kernels stride 2, IN.|
|Layer B5.1:||SepConv - kernels , IN, Relu.|
|Layer B5.2:||SepConv - kernels , IN, Relu.|
|Layer B5.3:||MaxPooling .|
|Layer 6:||Path A5 + Path B5 (output B5.3 + output A5.1)|
|Layer 7:||SepConv - kernels , IN, Relu.|
|Layer 8:||SepConv - kernels , IN, Relu.|
|Layer 9:||Global Average Pooling 2D.|
The training procedure consists of three phases, first we perform 10 epochs using stochastic gradient descend (SGD), then, we iterate 20 additional epochs using adam optimizer, and finally, we perform 10 epoch using SGD. During these phases the learning rate is set to. These three steps are commonly refer in the literature as “network warmup”, “training”, and “fine-tunning”, we observed that training each framework module following this protocol leads to stable and satisfactory results (as reported in Section 4). However, we did not focus in the present work on the optimization of the networks architecture, nor the training protocols.
d.3 Simulation of spoofing attacks
Spoofing attack are simulated to test face recognition models, in particular, how robust these frameworks are under (unseen) spoofing attacks. As in the present work we focus on the combination of texture and depth based features, the simulation of spoofing attacks must account for realistic texture and depth models. Algorithm 3 summarizes the main steps for the simulation of spoofing attacks (which are detailed next).
To simulate realistic texture conditions, CASIA dataset is analyzed [zhang2012face]. This dataset contains samples of videos collected under diverse spoofing attacks, e.g., video and photo attacks. 50 different subjects participated in the data collection, and 600 video clips were collected. We extracted a sub set of random frames from these videos (examples are illustrated in Figure 19), and use them to identify certain texture properties that characterize spoofing attacks, as we explain next.
Several works have been published in the resent years supporting that certain texture patterns are characteristics of spoofing photographs [Atoum2017, Li2018, li2004live, Li2017spoofing, Liu2018spoofing, yeh2017face]. In particular, differences in the Fourier domain have been reported [li2004live], which provide cues for certain classes of spoofing attacks [yeh2017face]. Following these ideas, we propose a simple model for the synthesis of spoofing attacks from genuine samples extracted from ND-2006 dataset. We assume a generic linear model , where represents the texture of the simulates spoofing attack for the genuine sample and an arbitrary kernel that will be set.
Let us denote as () a set facial images associated to spoofing attacks (here extracted from CASIA dataset), and () a set of genuine facial images (for example, from CASIA or ND-2006 datasets). As in Section 3 we denote the 2D discrete Fourier transform of . Given ground truth examples of genuine and spoofing facial images ( and ), we define the kernel as
where and are defined as follow
We set for numerical stability. Observe that and account only for the absolute value of the Fourier transform of real and spoofing samples, while the phase information is discarded. In principle, an arbitrary phase factor can be included, we constrain the solution to be a symmetric kernels, which is equivalent to enforce a null phase. To perform the average defined in Equation (15), the 2D coordinates associated to the frequency domain must be refer to a common coordinate frame, this allows to aggregate the frequency information of images of heterogeneous resolution.
Figure 20-(a) shows the kernel obtained using 600 spoofing samples from CASIA dataset and 600 genuine samples from ND-2006. As we can see, the resulting kernel is composed of predominantly positive values, this suggest that the application of it will produce essentially a blurred version of the original image. This empirical result is in accordance with the findings reported in [li2004live, yeh2017face]. Figure 20-(b) shows an example of an image of a genuine face (left), the result of applying the estimated kernel (center) and the different between them (right).
Finally, the depth profile of different types of spoofing attacks is simulated. To this end, different kinds of polynomial forms were evaluated. We simulated planar attacks, which take place when the attack is deployed using a phone or a tablet, and non-planar attacks, common when the attacker uses a curved printed portrait. We modeled the latter as random parabolic 3D shapes. The principal axis of each surface is randomly oriented, presenting an angle with respect to the vertical (as illustrates Figure 21). The depth profile z(x,y) is given by
where , is a constant representing the width of the spatial domain (here 480 pixels),, and (also a random variable) sets the offset of the principal axis and is sampled from the distribution . We also tested arbitrary polynomial forms with random coefficients, e.g.,
where the coefficients are samples from an uniform distribution, and is a normalization factor. Figure 21 illustrates (right side) some examples of spoofing depth profiles generated.
d.4 Facial appearance under different illumination
Using the original texture and depth facial information, the appearance of the face under novel illumination conditions can be simulated. One of the simplest and more accepted models consists of considering the facial surface as a lambertinan surface [basri2003lambertian, shashua1997, Zhou2007]. Hence, the amount of light reflected can be estimated as proportional to the cosine of the angle between the surface local normal and the direction in which the light approaches the surface.
where denotes the intensity of the light reflected, the intensity of the incident light, represents the surface albedo, is a unit vector normal to the surface, and a unit vector indicating the direction in which the light rays approach the surface at . If the depth profile is know, the normal vector to the surface at can be computed as
Appendix E Hardware implementation: a feasibility study
We tested a potential hardware implementation of the proposed ideas using commercially available hardware. To project the fringe pattern we used a projector EPSON 3LCD with native resolution , and a webcam Logitech C615 with native resolution . The features of this particular hardware setup are independent of the proposed ideas. One could choose, for example, projecting infrared light, or design a specific setup to meet specific deployment conditions.
Figure 23 shows some empirical results obtained by photographing a mannequin at different relative positions and under different illumination. We qualitative tested how the width of the projected pattern affects the estimation performance. In addition, we tested (see Figure 24) how the distance of the face to the camera-projector system affects the prediction of the depth derivative. These two experiments are related as changing the distance to the camera changes the camera perception of the fringes width.
It is important to take into account that in this experiments we tested our pre-trained models (completely agnostic to this particular hardware implementation), i.e., no fine-tuning or calibration was performed prior capturing the images presented in Figures 23 and 24. In production settings, in contrast, one would fix a specific hardware setup, collect new ground truth data, and fine-tune the models to optimize the setup at hand.