Differential 3D Facial Recognition: Adding 3D to Your State-of-the-Art 2D Method

04/03/2020 ∙ by J. Matias Di Martino, et al. ∙ 20

Active illumination is a prominent complement to enhance 2D face recognition and make it more robust, e.g., to spoofing attacks and low-light conditions. In the present work we show that it is possible to adopt active illumination to enhance state-of-the-art 2D face recognition approaches with 3D features, while bypassing the complicated task of 3D reconstruction. The key idea is to project over the test face a high spatial frequency pattern, which allows us to simultaneously recover real 3D information plus a standard 2D facial image. Therefore, state-of-the-art 2D face recognition solution can be transparently applied, while from the high frequency component of the input image, complementary 3D facial features are extracted. Experimental results on ND-2006 dataset show that the proposed ideas can significantly boost face recognition performance and dramatically improve the robustness to spoofing attacks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 10

page 13

page 22

page 25

page 26

page 28

page 29

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Two-dimensional face recognition has become extremely popular as it can be ubiquitously deployed and large datasets are available. In the past several years, tremendous progress has been achieved in making 2D approaches more robust and useful in real-world applications. Though 2D face recognition has surpassed human performance in certain conditions, challenges remain to make it robust to facial poses, uncontrolled ambient illumination, aging, low-light conditions, and spoofing attacks [Ding2016, kemelmacher2016megaface, Nech2017, taigman2014deepface]. In the present work we address some of these issues by enhancing the captured RGB facial image with 3D information as illustrated in Figure 1.

High resolution cameras became ubiquitous, although for 2D face recognition, we only need a facial image of moderate or low resolution. For example latest phones frontal camera have a very high resolution (e.g., pixels) while the resolution of the input to most face recognition systems is limited to pixels [arcface, parkhi2015deep, schroff2015facenet, taigman2014deepface, Zulqarnain2018]. This means that, in the context of face recognition, we are drastically underutilizing most of the resolution of captured images. We propose an alternative to use the discarded portion of the spectra and extract real 3D information by projecting a high frequency light pattern. Hence, a low resolution version of the RGB image remains approximately invariant allowing the use of standard 2D approaches, while 3D information is extracted efficiently from the local deformation of the projected patterns.

Figure 1: Real 3D face recognition is possible by capturing one single RGB image if a high frequency pattern is projected. The low frequency components of the captured image can be fed into a state-of-the-art 2D face recognition method, while the high frequency components encode local depth information that can be used to extract 3D facial features. It is important to highlight that, in contrast with most existing 3D alternatives, the proposed approach provides real 3D information, not 3D hallucination from the RGB input. As a result, state-of-the-art 2D face recognition methods can be enhanced with real 3D information.

The proposed solution to extract 3D facial features has key differences with the two common approaches presented in existing literature: 3D hallucination [eigen2014depth, Huber2015, Liu2015, Pini2018] and 3D reconstruction [Zafeiriou2013, Zou2005]. We will discuss these differences in detail in the following section. We illustrate the main limitation of 3D hallucination in the context of face recognition in Figure 2

, which emphasizes the lack of real 3D information on a standard RGB input image. We demonstrate that it is possible to extract actual 3D facial features bypassing the ill-posed problem of explicit depth estimation. Our contributions are summarized as follows:

  • Analyzing the spectral content of thousands of facial images, we design a high frequency light pattern that simultaneously allow us to retrieve a standard 2D low resolution facial image plus a 3D gradient facial representation.

  • We propose an effective and modular solution that achieves 2D and 3D information decomposition and facial feature extraction in a data-driven fashion (bypassing a 3D facial reconstruction).

  • We show that by defining an adequate distance function in the space of the feature embedding, we can leverage the advantages of both 2D and 3D features. We can transparently exploit existing state-of-the-art 2D methods and improve their robustness, e.g., to spoofing attacks.

2 Related Work

To recognize or validate the identity of a subject from a 2D color photograph is a longstanding problem of computer vision and has been largely studied for over forty years 

[kaya1972basic, zhao2003face]

. Recent advances in machine learning, and in particular, the success of deep neural networks, reshaped the field and yielded more efficient, accurate, and reliable 2D methods such as: ArcFace

[arcface], VGG-Face [parkhi2015deep], DeepFace [taigman2014deepface], and FaceNet [schroff2015facenet].

In spite of this, spoofing attacks and variations in pose, expression and illumination are still active challenges and significant efforts are being made to address them [Cao2018, Hayat2017, He2018, Kumar2018, Lezama2017, Liu2018, Tran2017, Yu2017, Zhao2018, Zou2005]. For example, Deng et al. [Deng2018] attempt to handle large pose discrepancy between samples. To that end, they propose an adversarial facial UV map completion GAN. Complementing previous approaches that seek for robust feature representations, several works propose more robust loss and metric functions [Liu2017, Wang2018].

3D hallucination from single RGB.

To enhance 2D approaches a common trend is to hallucinate a 3D representation from an input RGB image which is used to extract 3D features [blanz2003face, Dou2017, eigen2014depth, Huber2015, Liu2015, Pini2018]. For example, Cui et al. [Cui2018] introduce a cascade of networks that simultaneously recover depth from an RGB input while seeking for separability of individual subjects. The estimated depth information is then used as a complementary modality to RGB.

Figure 2: Illustration of three different 3D surfaces that look equivalent from a monocular view (single RGB image). On top, three surfaces (a), (b) and (c) are simulated, being (a) and (c) flat and (b) the 3D shape of a test subject. We use classic projective geometry [hartley2003multiple] and simulate the image we obtain when photographing (a), (b) and (c) respectively. The resulting images are shown at the bottom. As we illustrate with this simple example, the relation between images and 3D scenes is not bijective and the problem of 3D hallucination is ill-posed. To overcome this, 3D hallucination solutions enforce important priors about the geometry of the scene. This is why we argue, that these methods do not really add to the face recognition task, actual 3D information. (A complementary example is presented in Figure 15 in the supplementary material).

3D face recognition.

The approaches described previously share an important practical advantage that at the same time is their weakness, they extract all the information from a standard (RGB) 2D photograph of the face. As depicted in Figure 2 a single image does not contain actual 3D information. To overcome this intrinsic limitation different ideas have been proposed and datasets with 3D facial information are becoming more popular [Zulqarnain2018]. For example, Zafeiriou et al. [Zafeiriou2013] propose a four-light source photometric stereo (PS). A similar idea is elaborated by Zou et al. [Zou2005] who propose to use active near-infrared illumination and combine a pair of input images to extract an illumination invariant face representation.

Despite the previous mentioned techniques, performing a 3D facial reconstruction is still a challenging and complicated task. Many strategies have been proposed to tackle this problem, including time delay based [marks1992system], image cue based [eigen2015predicting, eigen2014depth, laina2016deeper, prados2006shape, saxena2009make3d], and triangulation based methods [ayubi2010pulse, DiMartino2015one, li2014some, Rosman2016, zhang2006high]. Although there has been great recent development, available technology for 3D scanning is still too complicated to be ubiquitously deployed [di2018one, hartley2003multiple, zhang2010recent, zhang2013handbook].

The proposed solution has two key features that make it, to the best of our knowledge, different from existing alternatives. (a) Because the projected pattern is of a high spatial frequency, we can recover a standard (low resolution) RGB facial image that can be fed into state-of-the-art 2D face recognition methods. (b) We avoid the complicated task of 3D facial reconstruction and instead, extract local 3D features from the local deformation of the projected pattern. In that sense our ideas can be implemented exploiting existing and future 2D solutions. In addition, our approach is different from those that hallucinate 3D information. As discussed before and illustrated in Figure 2 this task requires a strong prior of the scene which is ineffective, for example, if a spoofing attack is presented (see the example provided in Figure 15 in the supplementary material).

3 Proposed Approach

Notation.

Let denote the space of images with pixels and color channels, and

a space of n-dimensional column vectors (in the context of this work associated to a facial feature embedding).

denotes the set of RGB images (), while is used to denote the space of two channel images () associated to the gradient of a single-channel image . (The first/second channel represents the partial derivative with respect to the first/second coordinate.)

Figure 3: Architecture overview. First a network (illustrated in blue) is used to decompose the input image that contains overlapped high frequency fringes into a lower resolution (standard) texture facial image and depth gradient information. The former is used as the input of a state-of-the-art 2D face recognition DNN (yellow blocks). The depth information is fed to another network (green blocks) trained to extract discriminative (depth-based) facial features. Different network architectures are tested, we provide implementation details in Section D in the supplementary material.

Combining depth and RGB information.

The proposed approach consists of three main modules as illustrated in Figure 3: performs a decomposition of the input image into texture and depth information, , and extract facial features associated to the facial texture and depth respectively. These three components are illustrated in Figure 3 in blue, yellow, and green, respectively. (We decided to have three modules instead of a single end-to-end design for several reasons that will be discussed below.)

We denote the facial feature extraction from the input image as , where with . The subscript represent the parameters of the mapping , which can be decomposed in three groups , associated to the image decomposition, RGB feature extraction, and depth feature extraction respectively. In the following we discuss how these parameters are optimized for each specific task, which is one of the advantages of formulating the problem in a modular fashion.

Once texture and depth facial information is extracted into a suitable vector representation (as illustrated in Algorithm 1), we can select a distance measure to compare facial samples and estimate whether they have a high likelihood of belonging to the same subject or not. It is worth noticing that faces are embedded into a space in which the first half of the dimensions are associated to information extracted from the RGB representation while the other half codes depth information. These two sources of information may have associated different confidence levels (depending on the conditions at deployment). We address this in detail in Section 3.3 and propose an anisotropic distance adapted to our solution, and capable of leveraging the good performance of 2D solutions in certain conditions, while improving robustness and handling spoofing attacks in a continuous and unified fashion.

1:procedure FacialEmbedding()
2:Decompose the input image into texture and depth gradient information.
3:     
4:Extract facial information from each component.
5:     
6:     
7:Combine texture and depth information.
8:      Concatenate
9:     return Facial embedding
10:end procedure
Algorithm 1 Compute 2D facial features enhanced with 3D information.

3.1 Pattern design.

When a pattern of light is projected over a surface with a height map , it is perceived by a camera located along the -axis with a deformation given by (). A detailed description of active stereo geometry is provided in the supplementary material Section B. Let us denote the image we would acquire under homogeneous illumination, and the intensity profile of the projected light. Without loss of generality we assume the system baseline is parallel to the axis. The image acquired by the camera when the projected light is modulated with a profile is

(1)

We will restrict to periodic modulation patterns and let denote the pattern spatial period, we also define . To simplify the system design and analysis, lets also restrict to periodic patterns that are invariant to the coordinate. In these conditions we can express where represent the coefficients of the Fourier series of . (Note that because of the invariance with respect to the coordinate, the coefficients are constant instead of a function of .) Equation (1) can be expressed as

(2)

Defining , Equation (2) can be expressed as [takeda1983fourier]

(3)

Applying the 2D Fourier Transform (FT) in both sides of Equation (

3) and using standard properties of the FT [distributions] we obtain

(4)

We denote as the FT of and use

to represent the 2D frequency domain associated to

and axis respectively.

Figure 4: 2D plus real 3D in a single rgb image. The first column illustrates the RGB image acquired by a (standard) camera when horizontal stripes are projected over the face. The second column isolates the low frequency components of the input image, and the third column corresponds to the residual high frequency components. (In all the cases the absolute value of the Fourier Transform is represented in logarithmic scale). As can be seen, high frequency patterns can be used to extract 3D information of the face (third column) while preserving a lower resolution version of the facial texture (middle column).

Equation (4) shows that the FT of the acquired image can be decomposed into the components centered at . In the context of this section, we refer to a function being smooth if

(5)

Assuming and are smooth (we empirically validate this hypothesis below), the components can be isolated as illustrated in Figure 4. The central component is of particular interest, captures the facial texture information and can be recovered from if is large enough (we provide a more precise quantitative analysis in what follows). On the other hand, relative (gradient) 3D information can be retrieved from the components as we show in Proposition 1.

Proposition 1.

Gradient depth information is encoded in the components .

Proof.

We define the wrapping function . This function wraps the real set into the interval [ghiglia1998unwrapping]. This definition can be extended to vector inputs wrapping the modulus of the vector field while keeping its direction unchanged, i.e., if and if . From and we can compute111We assume images are extended in an even fashion outside the image domain, to guaranteed that and avoid an additional offset term.

(6)

where denotes the wrapped version of . Moreover, with (wrapping introduces shifts of magnitude multiple of ). Computing the gradient both sides leads to where . Assuming the magnitude of the gradient of is bounded by and considering that , we can apply the wrapping function both sides of the previous equality to obtain which proves (recall Equation (6)) that the gradient of can be extracted from the components and . To conclude the proof, we use the property of linearity of the gradient operation and the fact that is proportional to the depth map of the scene (see Equation (12) and Section B in the supplementary material). ∎

Figure 5: Faces average spectral content. The first column illustrates the mean luminance and depth map for the faces in the dataset ND-2006. The second column shows the mean Fourier Transform of the faces luminance and depth respectively. The third column shows the profile across different sections of the 2D Fourier domain. Columns two and three represent the absolute value of the Fourier transform in logarithmic scale. Faces are registered using the eyes landmarks and the size normalized to pixels.

Analytic versus data-driven texture and gradient depth extraction

The previous analysis shows that closed forms can be obtained to extract texture and depth gradient information. However, to compute these expressions is necessary to isolate different spectral components . To that end, filters need to be carefully designed. The design of these filters is challenging, e.g., one need to control over-smoothing versus introducing ringing artifact which are drastically amplified by a posterior gradient computation [DiMartino2015one, zhang2006high]. To overcome these challenges, we chose to perform a depth (gradient) and texture decomposition in a data-driven fashion, which as we show in Section 4, provides an efficient and effective solution.

Bounds on and optimal spectral orientation.

As discussed above, the projected pattern should have a large fundamental frequency . In addition, the orientation of the fringes and the system baseline can be optimized if faces present a narrower spectral content in a particular direction. We study the texture and depth spectrum of the facial images of ND-2006 dataset (this dataset provides ground truth facial texture and depth information). We observed (see Figure 5) that for facial images sampled at a spatial resolution, most of the energy is concentrated in a third of the discrete spectral domain (observe the extracted one dimensional profiles of the spectrum shown at the left side of Figure 5). In addition, we observe that the spectral content of facial images is approximately isotropic. See, for example, Figure 5 and observe how for 1-dimensional sections across different orientations the 2D spectra envelope is almost constant. We conclude that the orientation of the fringes does not play a significant role in the context of facial analysis. In addition, we conclude that the fringes width should be smaller than mm (distance measure over the face).222This numerical results is obtained by approximating the bounding box of the face as a region, sampled with pixels which corresponds to a pixel length of , a third of the spectral band correspond to signal of a period of pixels which leads to a binary fringe of at least wide.

3.2 Network training and the advantages of modularity.

As described previously, the parameters of the proposed solution can be split in three groups . This is an important practical property and we designed the proposed solution to meet this condition (in contrast to an end-to-end approach).

Let us define , , and three datasets containing ground truth depth information, ground truth identity for rgb facial images, and ground truth identity for depth facial images, respectively. More precisely, , , and , where denotes a (facial or generic) RGB image acquired under the projection of the designed pattern, represents (facial or generic) standard RGB images, denotes a gray image representing the depth of the scene, and a scalar integer representing the subject id.

We denote as the RGB and gradient depth components estimated by the decomposition operation . We partitioned the parameters of into two sets of dedicated kernels , the first group focuses on retrieving the texture component while the second group retrieves the depth gradient. These parameters can be optimized as

(7)
(8)

(We also evaluated training a shared set of kernels trained with an unified loss, this alternative is harder to train in practice, due to the natural difference between the dynamic range and sparsity of gradient images compared with texture images.)

For texture and depth facial feature extraction, we tested models inspired in the Xception architecture [chollet2017xception]). Additional details are provided in the supplementary material Section D

. To train these models we add an auxiliary fully connected layer on top of the facial embedding (with as many neurons as identities in the train set) and minimize the cross-entropy between the ground truth and the predicted labels. More precisely, let us denote

the output of the fully connected layer associated to the embedding where

denotes the probability associated to the id

,

(9)
(10)

where denotes the indicator function. (Of course one can choose other alternative losses to train these modules, see e.g., [arcface, Liu2017, Wang2018, Zheng2018].)

As described above, the proposed design allows to leverage information from three types of datasets (, , ). This has an important practical advantage as 2D facial and 3D generic datasets are more abundant, and the pattern dependant set can be of modest size as .

3.3 Distance design.

Once different modules are set we can compute the facial embedding of test images following the procedure described in Algorithm 1. Let us define and the feature embedding of two facial images and respectively. Recall that the first elements of are associated to features extracted from (a recovered) RGB facial image while the remaining elements are associated to depth information, i.e., .

We define the distance between two feature representations , and as

(11)

denotes the cosine distance, sets the relative weight of RGB and depth features, and define a non-linear response for the distance between depth features. As we will describe in the following, this provides robustness against common cases of spoofing attacks.

Intuitively, allows us to set the relative confidence associated to RGB and depth features, for example, gives the same weight to RGB and depth features, while () ignores the distance between samples in the depth (RGB) embedding space. This is important in practice, as is common to obtain substantially more data to train RGB models than depth ones (). This suggests that in good test conditions (e.g., good lighting) one may trust more RGB features over depth features (). As we will empirically show in the following section, when two facial candidates are compared, is an effective distance choice. However, it does not handle robustly common cases of spoofing attacks. The most common deployments of spoofing attacks imitate the facial texture more accurately than the facial depth [OULU_NPU_2017, Liu2018spoofing, Zhang2016], therefore, the global distance between two samples should be large when the distance of the depth features is large (i.e. above a certain threshold). To that end, we introduce an additional non-linear term controlled by parameters and , for the standard cosine distance dominates while for large values the distance will be amplified in a non-linear fashion.

4 Experiments and Discussion

Data.

Three public dataset are used for experimental validation: FaceScrub [FaceScrub], CASIA Anti-spoofing [zhang2012face], and ND-2006 [nd2006]. FaceScrub contains RGB (2D) facial images of different subjects, and is used to train the texture-based facial embedding. CASIA dataset contains genuine videos (recording a person) and videos of different types of spoofing attacks, the data was collected for subjects. We use this dataset to simulate and imitate the texture properties of images of spoofing attacks. ND-2006 is one of the larges publicly available datasets with 2D and 3D facial information, it contains images of subjects. We used this set to demonstrate that differential 3D features can be extracted from a single RGB input, to compare RGB features with 3D features extracted from the differential 3D input, and to show that when 2D and 3D information is properly combined, the best properties of each can be obtained.

Texture and differential 3D decomposition.

In Section 3.1 we discussed how real 3D information and texture information can be coded and later extracted using a single RGB image. In addition, we argue that this decomposition can be learned efficiently and effectively in a data-driven fashion. To that end, we tested simple network architectures composed of standard convolutional layers (a full description of these architectures and the training protocols are provided as supplementary material). Using ground truth texture and depth facial information, we simulated the projection of the designed pattern over the subjects provided in ND-2006 dataset. Illustrative results are presented in Figure 6 and in the supplementary material. The 3D geometrical model and a detailed description of the simulation process is provided in Section D.1. Though the simulation of the deformation of a projected pattern can be computed in a relatively simple manner (if the depth information is known), the inverse problem is analytically hard [DiMartino2015one, Rosman2016, zhang2013handbook].

Figure 6: Active light projection. From left to right: ground truth RGB facial image, 3D facial scanner, and finally the image we would acquire if the designed high frequency pattern is projected over the face. Two random samples from ND-2006 are illustrated.

Despite the previous, we observed that a stack of convolutional layers can efficiently learn how to infer from the image with the projected pattern, both depth gradient information, and the standard (2D) facial image. Figure 7 illustrates some results for subjects in the test set. The first column corresponds to the input to the network, the second column the ground truth texture information, and the third column the retrieved texture information. The architecture of the network and the training protocol is described in detail in the supplementary material Section D. As we can see in the examples illustrated in Figure 7, an accurate low resolution texture representation of the face can be achieved in general, and visible artifact are observed only in the regions where the depth is discontinuous (see for example, the regions illustrated at the bottom of Figure 7).

Figure 7: Examples of the facial texture recovered from the image with the projected pattern. The first column, shows the input image (denoted as in Algorithm 1). The second column shows the ground truth, and the third column the texture recovered by the network . This examples are from the test set and the images associated to these subjects were never seen during the training phase.

Figure 8 illustrates the ground truth and the retrieved depth gradient (again, for random samples from the test set). To estimate the 3D information, we feed to a different branch of convolutional layers the gray version of the input image. These layers are fully described in the supplementary material Table 5. A gray input image is considered instead of a color one because the projected pattern is achromatic, and therefore, no 3D information is encoded in the colors of the image. In addition, we crop the input image to exclude the edges of the face. (Facial registration and cropping is performed automatically using dlib [king2009dlib] facial landmarks.) As discussed in Section 3, and in particular, in the proof of Proposition 1, the deformation of the projected fringes only provide local gradient information if the norm of the gradient of the depth is bounded. In other words, where the scene present depth discontinuities, no local depth information can be extracted by our proposed approach. This is one of the main reasons why differential 3D information can be exploited for face recognition, while bypassing the more complicated task of a 3D facial reconstruction.

Figure 8: Differential depth information extracted from the image with the projected pattern. The first row illustrates the input image (depth information can be extracted from a gray version of the input as the designed patter is achromatic). The second and third row show the ground truth and the retrieved and partial derivatives of the depth respectively.

One of the advantages of the proposed approach is that it extracts local depth information, and therefore, the existence of depth discontinuities does not affect the estimation on the smooth portion of the face. This is illustrated in Figure 9 (a)-(b), where a larger facial patch is fed into the network. The decomposition module is composed exclusively of convolutional layers, and therefore, images of arbitrary size can be evaluated. Figure 9-(a) shows the input to the network, and Figure 9-(b) the first channel of the output (for compactness we display only the x-partial derivative). As we can see, the existence of depth discontinuities does not affect the prediction in the interior of the face (we consider the prediction outside this region as noise and we replace it by for visualization).

Figure 9: Is the network really extracting depth information? In this figure we show the output of the network for two inputs generated using identical facial texture but different depth ground truth data. (a) Image obtained when the projected pattern is projected over the face with the real texture and the real 3D profile. (b) Output of the network when we input (a) (only the x-partial derivative is displayed for compactness). (c) Image obtained when the projected pattern is projected over a flat surface with the texture of the real face. (d) Output of the network when the input is (c). None of these images were seen during training.

Several algorithms have been proposed to hallucinate 3D information from a 2D facial image [eigen2014depth, Huber2015, Liu2015, Pini2018]. In order to verify that our decomposition network is extracting real depth information (in lieu of hallucinating it from texture cues), we simulated an image where the pattern is projected over a surface with identical texture but with a planar 3D shape (as in the example illustrated in Figure 2). Figure 9 (a) shows the image acquired when the fringes are projected over the ground truth facial depth, and (c) when instead the depth is set to (without modifying the texture information). The first component of the output (x-partial derivative) is shown in (b) and (d), as we can see, the network is actually extracting true depth information (from the deformation of the fringes) and not hallucinating 3D information from texture cues. (As we will see next, this property is particularly useful for joint face recognition and spoofing prevention.)

2D and 3D face recognition.

Once the input image is decomposed into a (standard) texture image and depth gradient information, we can proceed to extract 2D and 3D facial features from each component. To this end, state-of-the-art network architectures are evaluated. Our method is agnostic to the RGB and depth feature extractors, moreover, as the retrieved texture image is close to a standard RGB facial images (in sense of the L2-norm), any pre-train 2D feature extractor can be used (e.g., [arcface, parkhi2015deep, schroff2015facenet, taigman2014deepface, Zulqarnain2018]). In the experiments presented in this section we tested a network based on the Xception architecture [chollet2017xception] (details are provided as supplementary material). For the extraction of texture features, the network is trained using FaceScrub [FaceScrub] dataset (as we previously described, this is a public dataset of 2D facial images). The module that extracts 3D facial features is trained using of the subjects of ND-2006 dataset, leaving the remaining subjects exclusively for testing. The output of each module is a 512-dimensional feature vector (see, e.g., Figure 3), hence the concatenation of 2D+3D features leads to a 1024-dimensional feature vector. Figure 10 illustrates a 2D embedding of the texture features, the depth features, and the combination of both. The 2D mapping is learned by optimizing the t-SNE [maaten2008visualizing] over the train partition, then a random subset of test subjects are mapped for visualization. As we can see, 3D features favor the compactness and increase the distance between clusters associated to different subjects.

Figure 10: Facial features low dimensional embedding (for visualization purposes only). We illustrate texture-based and depth-based features in a low dimensional embedding space. A random set of subject of the test set is shown. From left to right: the embedding of depth-features, texture-based features, and finally, the combination of texture and depth features. t-SNE [maaten2008visualizing] algorithm is used for the low-dimensional embedding.

To test the recognition performance, the images of the test subjects are partitioned into two sets: gallery and probe. For all the images in both sets, the 2D and 3D feature embedding is computed (using the pre-trained networks described before). Then, for each image in the probe set, the nearest neighbors in the gallery set are selected. The distance between each sample (in the embedding space) is measured using the distance defined in Section 3, Equation (11). For each sample in the probe set, we consider the classification as accurate, if at least one of the

nearest neighbors is a sample from the same subject. The Rank-n accuracy is the percentage of samples in the probe set accurately classified.

Figure 11: Rank-n accuracy for 2D, 3D, and 2D+3D face recognition. As discussed in Section 3 the value of can be set to weight texture and depth information in the classification decision. The extreme cases are (only texture is considered) and (only depth is considered). These extreme cases are illustrated in yellow and blue respectively, while intermediate solutions () are presented in tones of green.

Figure 11 and Table 1 show the Rank-n accuracy when: only 2D features (), only 3D features (), or a combination of both () is considered. As explained in Section 3.3, the value of can be used to balance the weight of texture and depth features. As we can see, in all the cases a combination of texture and depth information outperforms each of them individually. This is an expected result as classification tends to improve when independent sources of information are combined [Kuncheva2004]. is an hyper-parameter that should be set depending on the conditions at deployment. In our particular experiments the best results are obtained for , which suggests that RGB features are slightly more reliable than depth features. This is an expected result as the module that extract RGB features is typically trained in a much larger datasets (2D facial images became ubiquitous). We believe this may change if, for example, testing is performed under low light conditions [Lezama2017]. Testing this hypothesis is one of the potential path for future research. In the experiment discussed so far, we ignored the role of and (i.e., we set and ). As we will discuss in the following, these parameters become relevant to achieve jointly face recognition and spoofing prevention.

Rank-n Accuracy 1 2 5 10
RGB baseline ()
Depth baseline ()
(our)
(our)
(our)
Table 1: Rank-n accuracy for 2D, 3D, and 2D+3D face recognition. As discussed in Section 3 the value of can be set to weight the impact of texture and depth information. The extreme cases are (only texture is considered) and (only depth is considered)

Robustness to spoofing attacks.

Spoofing attack are simulated to test face recognition models, in particular, how robust these frameworks are under (unseen) spoofing attacks. As in the present work we focus on the combination of texture and depth based features, the simulation of spoofing attacks must account for realistic texture and depth models. The models for the synthesis of spoofing attacks are described in detail in the supplementary material Section D.3.

Figure 12: Examples of samples from live subjects and spoofing attacks. From left to right: (1) the ground truth texture, (2) the ground truth depth, (3) the input to our system (image with the projected pattern), (4) the recovered texture component (one of the outputs of the decomposition network), (5)/(6) recovered / depth partial derivative. The first four rows correspond to spoofing samples (as explained in Section D.3), and the bottom five rows to genuine samples from live subjects.

Figure 12 illustrates spoofing samples (first four rows) and genuine samples (bottom five rows). The first two columns correspond to the ground truth texture and depth information, the third column illustrates the input to our system, and the last three columns correspond to the outputs of the decomposition network. These three last images are fed into the feature extraction modules for the extraction of texture and depth based features respectively, as illustrated in Figure 3. It is extremely important to highlight, that spoofing samples are included exclusively at testing time. In other worlds, during all the training process the entire framework is agnostic to the existence of spoofing examples. If the proposed framework is capable of extracting real 3D facial features, it should be inherently robust to most common types of spoofing attacks.

As discussed before, the combination of texture and depth based features improves recognition accuracy. On the other hand, when spoofing attacks are included, we observe that texture based features are more vulnerable to spoofing attacks (see for example figure 12 and 14). To simultaneously exploit the best of each feature component, we design a non-linear distance as described in Equation (11). Figure 13 illustrates the properties of the defined distance for different values of and . As it can be observed, for those genuine samples (relative distances lower than ) the non linear component can be ignored and the distance behave as the euclidean distance with a relative modulation set by . On the other hand, if the distance between the depth components is above the threshold , it will dominate the overall distance achieving a more robust response to spoofing attacks.

Figure 13: Illustration of the properties of the distance function defined in (11). On the left side we illustrate the role of the parameter , and on the right, we compare the proposed distance and the standard euclidean distance. As can be observed, both measures are numerically equivalent in the region , but the proposed measure gives a higher penalty to vectors whose coordinate exceeds the value .

To quantitatively evaluate the robustness against spoofing attacks, spoofing samples are generated for all the subjects in the test set. As before, the test set is separated into a gallery and a probe set and the generated spoofing samples are aggregated into the probe set. For each image in the probe set, the distance to a sample of the same subject in the gallery set is evaluated. If this distance is below a certain threshold , the image is labeled as genuine, otherwise, the image is labeled as spoofing. Comparing the classification label with the ground truth label we obtain the number of true positive (genuine classified as genuine), false positive (spoofing classified as genuine), true negative (spoofing classified as spoofing), and false negative (genuine classified as spoofing). Changing the value of the threshold we can control the number of false positive versus the number of false negatives as illustrated in Figure 14.

Figure 14: False acceptance rate and false rejection rate under the presence of spoofing attacks. On color blue we illustrate the RGB baseline (), on the other extreme, the red curve illustrates the performance when only depth features are considered. The combination of RGB and depth features is illustrated in tones of green for different values of and (in this experiment we set ).

Figure 14 shows the ratio of false positive and false negative for . As before the distance between the samples is computed using the definition provided in (11), in blue/red the RGB/depth baseline is illustrated, the other set of curves (displayed in green tones) correspond to a combination of texture and depth features with and different values of and . In Table 2 the ratio of true positive is reported for a fixed ratio of false positives. The ACER measure (last column) corresponds to the average between the ratio of spoofing and genuine samples misclassified.

TPR% @FPR= TPR% @FPR= ACER %
RGB baseline ()
Depth baseline ()
(our)
(our)
(our)
(our)
(our)
(our)
(our)
(our)
(our)
Table 2: Spoofing detection results. The ratio of true positive for a fixed ratio of false positive and the ACER measure are reported. Texture and depth facial features are combined using the distance defined in (11). As we can see, the parameters , and can be set to obtain better facial recognition performance and robustness against spoofing detection.

Testing variations on the ambient illumination.

To test the impact of variations on lighting conditions we simulated test samples under different ambient illumination, implementation details are described in the supplementary material Section D.4. Table 3 compares the rank-5 accuracy of 2D features and 2D+3D features as the power of the ambient illumination increases. As described in the supplementary material, the ambient illumination is modeled with random orientation, and therefore, the more powerful the illumination is the more diversity between the test and the gallery samples is introduced.

Rank-5 Accuracy power=100% power=150% power=200%
RGB baseline ()
(our)
Table 3: Recognition accuracy under different ambient illumination conditions. The power of the additional ambient light is provided relative to the power of the projected light, i.e., power=200% means that the added ambient illumination is twice as bright as the projected pattern.

In the present experiments, we assumed that both the projected pattern and the ambient illumination have similar spectral content. In practice, one can project the pattern, e.g., on the infrared band. This would make the system invisible to the user, and reduce the sensitivity of 3D features to variations on the ambient illuminations. We provide a hardware implementation feasibility study and illustrate how the proposed ideas can be deployed in practice in the supplementary material Section E.

Improving state of the art 2D face recognition.

To test how the proposed ideas can impact the performance of state-of-the-art 2D face recognition systems, we evaluated our features in combination with texture based features obtained with ArcFace [arcface]. ArcFace is a powerful method pre-trained on very large datasets, on ND-2006 examples it achieves perfect recognition accuracy (100% rank-1 accuracy). When ArcFace is combined with the proposed 3D features, the accuracy remains excellent (100% rank-1 accuracy), i.e., adding the proposed 3D features does not negatively affects robust 2D solutions. On the other hand, 3D features improve ArcFace on challenging conditions as we discuss in the following. Interesting results are observed when ArcFace is tested under spoofing attacks, as we show in Table 4, ArcFace fails to detect spoofing attacks. ArcFace becomes more robust when it is combined with 3D features, improving from nearly TPR@FPR() to . In summary, as 2D methods improve and become more accurate, our 3D features do not affect them negatively when they work well, while improve their robustness in challenging situations.

TPR% @FPR= TPR% @FPR= ACER %
ArcFace ()
ArcFace + 3D
Table 4: Spoofing detection results for ArcFace and ArcFace enhanced with 3D features. Like in Table 2, the ratio of true positive for a fixed ratio of false positive and the ACER measure are reported.

5 Conclusions

We proposed an effective and modular alternative to enhance 2D face recognition methods with actual 3D information. A high frequency pattern is designed to exploit the high resolution cameras ubiquitous in modern smartphones and personal devices. Depth gradient information is coded in the high frequency spectrum of the captured image while a standard texture facial image can be recovered to exploit state-of-the-art 2D face recognition methods. We show that the proposed method can be used to simultaneously leverage 3D information and texture information. This allows us to enhance state-of-the-art 2D methods improving their accuracy and making them robust, e.g., to spoofing attack.

Acknowledgments

Work partially supported by ARO, ONR, NSF, and NGA.

References

Appendix A Limitations of 3D hallucination in the context of face recognition.

As discussed in Section 2 3D hallucination methods have intrinsic limitations in the context of face recognition. To complement the example illustrate in Figure 2, here we show 5 3D facial models obtained by hallucinating 3D from a single RGB input image. To that end, we apply the 3D morphable model (extremely popular in facial applications). As we can see, even in the case of a planar spoofing attack, a face-like 3D shape is retrieved despite that this is far form the actual 3D shape of the actual scene. This is an expected results, as we discuss in Section 2 the problem of 3D hallucination is ill-posed, and therefore priors need to be enforced in order to obtain feasible implementations.

Figure 15: Example of 3D hallucination [blanz2003face] from photographs of live faces versus portraits. Second row illustrates the 3D morphable model (second row) of images of live subjects (first and fourth columns) and for photographs of portraits of the same subjects (second, third and fifth columns). Results obtained using the code from [Huber2015]

Appendix B Review of Active Stereo Geometry

When a structured pattern of light is projected over a surface, it is perceived with a certain deformation depending on the shape of the surface. For example, the top left image in Figure 4 is acquired while horizontal stripes are projected over the subject face. As we can see, fringes are no longer perceived as horizontal and parallel to each other, this deformation codes rich information about the geometry of the scene (which will be exploited to enhance 2D face recognition methods). Figure 16 sketches the geometry of this situation as if we were looking at the face from the side (vertical section). On the left, a light source project a ray of light trough the points (red line), when this ray is reflected by a reference plane at the point , it is viewed by the camera at the pixel location represented by . If instead, light is projected over an arbitrary (non-planar) surface, the reflection is produced at point and viewed by the camera at the shifted position . From now on, we denote the length of the shift as the disparity .

Similarity between triangles - and - leads to and . Defining (baseline between the lighting source and the camera sensor), (camera focal length), (surface local height), (distance of the subject to the camera), and assuming that we obtain,

(12)
Figure 16: Active stereo geometry. Left sensor illustrates the optical image plane of the device projecting light (with optical center at point ), and the right sensor illustrates the camera sensor (with optical center at ). A ray of light projected over a reference plane (at point ) is perceived by the camera at the pixel position . When the same ray of light is projected over an arbitrary surface (at point ), it is perceived by the camera at a shifted location . The disparity is proportional to (height of the surface).

Equation (12) quantitatively relates the perceived shift (disparity) of the projected pattern and the surface 3D shape . As the goal of the present work is to provide local 3D features instead of performing an actual 3D reconstruction, the particular value of , and are irrelevant as we shall see. Moreover, we exploit the fact that the gradient of the local disparity codes information of the depth gradient. This allows us to extract local geometrical features bypassing the more challenging steps of 3D reconstruction: global matching and gradient field integration [zhang2013handbook, di2018one, ghiglia1998unwrapping, hartley2003multiple].

Appendix C Gradient Information is Easy to Compute, Absolute Depth is Hard.

One of the key ideas of the presented approach is to estimate local depth descriptors instead of absolute depth information. While the former is an easier task and absolute information is irrelevant for the sake of feature extraction. After all, a robust feature representation should be scale and translation invariant.

More precisely, we can define the problem of integrating the absolute depth from an empirical estimation of its gradient map in a 2D domain as the optimization problem:

(13)

The solution of Equation (13) is non trivial. Noise, shadows, and facial discontinuities produce empirical gradient estimations with irotational components, i.e., for some pixels the rotor of the field is different from zero . Therefore, the space of target functions and the minimization norm must be carefully set in order to achieve a meaningful solution to Equation (13). This is a complex mathematical problem and has been extensively studied in the literature [agrawal2005algebraic, agrawal2006range, du2007robust, reddy2009enforcing, tumblin2005want].

Appendix D Implementation details

Figure 17: Samples from the nd2006 database. On the left column, the RGB images can be seen, while on the right their corresponding depth images are shown.

d.1 Light projection

Figure 18: Texture and shape information on a single RGB image. On the left side we show the ground truth depth (top) and texture (bottom) facial information for a samples of ND-2006 dataset. With the geometric model described in Section B, we simulated images acquired under the projection of a periodic fringe pattern. The absolute value of the Fourier transform (in logarithmic scale) is illustrated at the right side of each example. From left to right, we show who the baseline (distance between the light source and the camera) impact the simulation. Recall the role of the baseline, defined as in Equation 12. From the top to the bottom, we show the effect of the width of the fringes, which define the fundamental frequency of the pattern , see Section 3.1. Note that as we are displaying high frequency patterns in the Figure, the reader may be observing aliasing artifacts due to a poor pdf resolution (zooming into the image is recommended).

From ground truth depth and texture facial information, images under the projection different high frequency patterns can be simulated as illustrated in Figure 18. Model the physical configuration of the system as described in Section B, and different parameters for the baseline and the fringe width were tested. We assumed in all our experiments a fixed focal length. The pseudo-code for the generation of samples under active illumination is summarized in Algorithm 2. It is important to highlight that thought the problem of simulating the pattern deformation from the depth is easy, the opposite is a very hard problem. This is why we design a DNN-based approach that estimates from the deformation of the pattern the gradient of the depth rather than the depth itself (see Section 3).

1:procedure LightProjection()
2:Read the light pattern to be projected (pre-designed).
3:     
4:Simulate the local disparity (see (12)).
5:     
6:Compute local pattern deformation.
7:     
8:Account for the texture.
9:     
10:     return Image with active illumination.
11:end procedure
Algorithm 2 Active light projection. Model the resulting RGB image when a pattern of structured light is projected over a surface with depth profile and texture .

d.2 Networks architecture

Table 5 illustrates the architecture of the network that performs texture and depth information decomposition (as described in Section 3). Table 6 illustrates the architecture of the layers trained for facial feature extraction (illustrated as yellow/green block in Figure 3

). Each module of the proposed framework is implemented using standard tensorflow layers (version 1.13).

Input: image with proj. fringes.
(A) RGB - branch
Layer A.1: Conv - kernels , BN, LeakyRelu.
Layer A.2: Conv - kernels , BN, LeakyRelu.
Layer A.3: Conv - kernels , BN, LeakyRelu.
Layer A.4: Conv - kernels , BN, LeakyRelu.
Layer A.5: Resize(
Output A: recovered texture.
(B) Depth - branch
Layer B.1: Average(axis=3) (convert to gray).
Layer B.2: Conv - kernels , BN, LeakyRelu.
Layer B.3: Conv - kernels , BN, LeakyRelu.
Layer B.4: Conv - kernels , BN, LeakyRelu.
Layer B.5: Conv - kernels , BN, LeakyRelu.
Layer B.6: Resize(
Output B: recovered .
Table 5: Decomposition network (illustrated in blue in Figure 3

). Conv denotes convolution layer, and BN batch normalization. Standard tensorflow (v1.13) layers are used.

Input: ( for texture, for depth image).
Layer 1.1: Conv - kernels stride

, IN, Relu.

Layer 1.2: Conv - kernels stride , IN, Relu.
Path A1
Layer A1.1: Conv - kernels stride 2, IN.
Path B1
Layer B1.1: SepConv - kernels , IN, Relu.
Layer B1.2: SepConv - kernels , IN, Relu.
Layer B1.3: MaxPooling .
Layer 2: Path A1 + Path B1 (output B1.3 + output A1.1)
Path A2
Layer A2.1: Conv - kernels stride 2, IN.
Path B2
Layer B2.1: SepConv - kernels , IN, Relu.
Layer B2.2: SepConv - kernels , IN, Relu.
Layer B2.3: MaxPooling .
Layer 3: x = Path A2 + Path B2 (output B2.3 + output A2.1)
Middle flow: x: (Repeat 2 times)
Layer 4.1: SepConv - kernels , IN, Relu.
Layer 4.2: SepConv - kernels , IN, Relu.
Layer 4.3: SepConv - kernels , IN, Relu.
Layer 4.4: Add(x, output Layer )
Exit flow:
Path A5
Layer A5.1: Conv - kernels stride 2, IN.
Path B5
Layer B5.1: SepConv - kernels , IN, Relu.
Layer B5.2: SepConv - kernels , IN, Relu.
Layer B5.3: MaxPooling .
Layer 6: Path A5 + Path B5 (output B5.3 + output A5.1)
Layer 7: SepConv - kernels , IN, Relu.
Layer 8: SepConv - kernels , IN, Relu.
Layer 9: Global Average Pooling 2D.
Output: facial features.
Table 6: Feature embedding (illustrated in yellow/green in Figure 3). Conv denotes convolution layer, SepConv separable convolution, BN batch normalization, IN instance normalization. Standard tensorflow (v1.13) layers are used.

Training protocol.

The training procedure consists of three phases, first we perform 10 epochs using stochastic gradient descend (SGD), then, we iterate 20 additional epochs using adam optimizer, and finally, we perform 10 epoch using SGD. During these phases the learning rate is set to

. These three steps are commonly refer in the literature as “network warmup”, “training”, and “fine-tunning”, we observed that training each framework module following this protocol leads to stable and satisfactory results (as reported in Section 4). However, we did not focus in the present work on the optimization of the networks architecture, nor the training protocols.

d.3 Simulation of spoofing attacks

Spoofing attack are simulated to test face recognition models, in particular, how robust these frameworks are under (unseen) spoofing attacks. As in the present work we focus on the combination of texture and depth based features, the simulation of spoofing attacks must account for realistic texture and depth models. Algorithm 3 summarizes the main steps for the simulation of spoofing attacks (which are detailed next).

1:procedure TransformText() Sim. spoofing texture
2:Init. optimal filter (this is done only once and off-line).
3:      InitKernel(RealEx., SpoofEx.) (14)
4:Filter the genuine sample.
5:      filter
6:     return Simulated spoofing texture
7:end procedure
8:procedure TransformDepth Sim. spoofing depth
9:Compute spoofing depth
10:      = SpoofDepthModel() E.g., (16),(17).
11:     return Simulated spoofing depth
12:end procedure
13:procedure SpoofingSample()
14:Simulate sample texture.
15:      TransformText()
16:Simulate sample depth.
17:      TrasformDepth()
18:     return
19:end procedure
Algorithm 3 Synthesis of the texture and depth of spoofing attacks.

To simulate realistic texture conditions, CASIA dataset is analyzed [zhang2012face]. This dataset contains samples of videos collected under diverse spoofing attacks, e.g., video and photo attacks. 50 different subjects participated in the data collection, and 600 video clips were collected. We extracted a sub set of random frames from these videos (examples are illustrated in Figure 19), and use them to identify certain texture properties that characterize spoofing attacks, as we explain next.

Figure 19: Spoofing and genuine examples from CASIA [zhang2012face] dataset.

Several works have been published in the resent years supporting that certain texture patterns are characteristics of spoofing photographs [Atoum2017, Li2018, li2004live, Li2017spoofing, Liu2018spoofing, yeh2017face]. In particular, differences in the Fourier domain have been reported [li2004live], which provide cues for certain classes of spoofing attacks [yeh2017face]. Following these ideas, we propose a simple model for the synthesis of spoofing attacks from genuine samples extracted from ND-2006 dataset. We assume a generic linear model , where represents the texture of the simulates spoofing attack for the genuine sample and an arbitrary kernel that will be set.

Let us denote as () a set facial images associated to spoofing attacks (here extracted from CASIA dataset), and () a set of genuine facial images (for example, from CASIA or ND-2006 datasets). As in Section 3 we denote the 2D discrete Fourier transform of . Given ground truth examples of genuine and spoofing facial images ( and ), we define the kernel as

(14)

where and are defined as follow

(15)

We set for numerical stability. Observe that and account only for the absolute value of the Fourier transform of real and spoofing samples, while the phase information is discarded. In principle, an arbitrary phase factor can be included, we constrain the solution to be a symmetric kernels, which is equivalent to enforce a null phase. To perform the average defined in Equation (15), the 2D coordinates associated to the frequency domain must be refer to a common coordinate frame, this allows to aggregate the frequency information of images of heterogeneous resolution.

Figure 20: Texture simulation of spoofing attacks. (a) Illustrates the linear kernel optimized such that the Fourier domain of real samples match that of spoofing samples. (b) Shows (left) an example of an image of a genuine subject (from ND-2006 set), the simulated texture of a spoofing attack (center), and finally (right) the different between the two.

Figure 20-(a) shows the kernel obtained using 600 spoofing samples from CASIA dataset and 600 genuine samples from ND-2006. As we can see, the resulting kernel is composed of predominantly positive values, this suggest that the application of it will produce essentially a blurred version of the original image. This empirical result is in accordance with the findings reported in [li2004live, yeh2017face]. Figure 20-(b) shows an example of an image of a genuine face (left), the result of applying the estimated kernel (center) and the different between them (right).

Finally, the depth profile of different types of spoofing attacks is simulated. To this end, different kinds of polynomial forms were evaluated. We simulated planar attacks, which take place when the attack is deployed using a phone or a tablet, and non-planar attacks, common when the attacker uses a curved printed portrait. We modeled the latter as random parabolic 3D shapes. The principal axis of each surface is randomly oriented, presenting an angle with respect to the vertical (as illustrates Figure 21). The depth profile z(x,y) is given by

(16)

where , is a constant representing the width of the spatial domain (here 480 pixels),

sets the curvature of and is a random variable sampled from the normal distribution

, and (also a random variable) sets the offset of the principal axis and is sampled from the distribution . We also tested arbitrary polynomial forms with random coefficients, e.g.,

(17)

where the coefficients are samples from an uniform distribution

, and is a normalization factor. Figure 21 illustrates (right side) some examples of spoofing depth profiles generated.

Figure 21: Depth simulation of the spoofing attacks. On the left we illustrate the geometry of the simulation process. The right side illustrates examples of the generated depth profiles associates to spoofing attacks.

d.4 Facial appearance under different illumination

Using the original texture and depth facial information, the appearance of the face under novel illumination conditions can be simulated. One of the simplest and more accepted models consists of considering the facial surface as a lambertinan surface [basri2003lambertian, shashua1997, Zhou2007]. Hence, the amount of light reflected can be estimated as proportional to the cosine of the angle between the surface local normal and the direction in which the light approaches the surface.

More precisely,

(18)

where denotes the intensity of the light reflected, the intensity of the incident light, represents the surface albedo, is a unit vector normal to the surface, and a unit vector indicating the direction in which the light rays approach the surface at . If the depth profile is know, the normal vector to the surface at can be computed as

(19)

Algorithm 4 summarizes the main steps for simulating new samples under novel illumination conditions. Figure 22 shows some illustrative results for illuminants of different intensity and located at different relatives positions with respect to the face.

1:procedure addAmbientLight()
2:Make sure the depth map is smooth and noise-free.
3:      Denoising (Gradient op. amplifies noise.)
4:Compute depth gradient.
5:      ComputeGradient
6:Compute surface normals (19).
7:     
8:     
9:Additional light factor (due to the new source).
10:     
11:For each color add the additional brightness.
12:     
13:Model camera saturation.
14:     clip
15:     return
16:end procedure
Algorithm 4 Steps for the simulation of additional ambient light. The inputs represent: the facial albedo, the facial depth map, the intensity of the source light, the direction in which the light source is located. Examples are shown in Figure 22.
Figure 22: Examples of new samples under different lighting conditions. The groups on the left, middle and right, correspond to a light source located at the left, right, and top of the scene respectively. The images on the bottom are created assuming a brighter light source (higher value of , see Algorithm 4).

Appendix E Hardware implementation: a feasibility study

We tested a potential hardware implementation of the proposed ideas using commercially available hardware. To project the fringe pattern we used a projector EPSON 3LCD with native resolution , and a webcam Logitech C615 with native resolution . The features of this particular hardware setup are independent of the proposed ideas. One could choose, for example, projecting infrared light, or design a specific setup to meet specific deployment conditions.

Figure 23 shows some empirical results obtained by photographing a mannequin at different relative positions and under different illumination. We qualitative tested how the width of the projected pattern affects the estimation performance. In addition, we tested (see Figure 24) how the distance of the face to the camera-projector system affects the prediction of the depth derivative. These two experiments are related as changing the distance to the camera changes the camera perception of the fringes width.

Figure 23: Testing the model on a hardware implementation. We projected the proposed light pattern using a commercial projector (EPSON 3LCD) and captured facial images using a standard webcam (Logitech C615). Side by side, we show the captured image (left) and the x-partial derivative of the depth estimated by our DNN model. We tested projected patters of different period, ranging from 10px to 27px (pixels measured in the projector sensor).
Figure 24: Changing the distance to the camera and pose. Complementing the results shown in Figure 23, we tested how our DNN model perform as the distance to the camera-project system changes. The left column shows side by side the image captured by the camera and the partial derivative of the depth estimated by our pre-trained model. The distance of the test face to the camera ranged from 50cm to 70cm. The left column shows on the top the results obtained for different head-poses, and on the bottom, the output when a planar facial portrait is presented.

It is important to take into account that in this experiments we tested our pre-trained models (completely agnostic to this particular hardware implementation), i.e., no fine-tuning or calibration was performed prior capturing the images presented in Figures 23 and 24. In production settings, in contrast, one would fix a specific hardware setup, collect new ground truth data, and fine-tune the models to optimize the setup at hand.