Synthesizing Coupled 3D Face Modalities by Trunk-Branch Generative Adversarial Networks

09/05/2019 ∙ by Baris Gecer, et al. ∙ 5

Generating realistic 3D faces is of high importance for computer graphics and computer vision applications. Generally, research on 3D face generation revolves around linear statistical models of the facial surface. Nevertheless, these models cannot represent faithfully either the facial texture or the normals of the face, which are very crucial for photo-realistic face synthesis. Recently, it was demonstrated that Generative Adversarial Networks (GANs) can be used for generating high-quality textures of faces. Nevertheless, the generation process either omits the geometry and normals, or independent processes are used to produce 3D shape information. In this paper, we present the first methodology that generates high-quality texture, shape, and normals jointly, which can be used for photo-realistic synthesis. To do so, we propose a novel GAN that can generate data from different modalities while exploiting their correlations. Furthermore, we demonstrate how we can condition the generation on the expression and create faces with various facial expressions. The qualitative results shown in this pre-print is compressed due to size limitations, full resolution results and the accompanying video can be found at the project page: https://github.com/barisgecer/TBGAN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 5

page 6

page 7

page 8

page 9

page 10

Code Repositories

TBGAN

Project Page of 'Synthesizing Coupled 3D Face Modalities by Trunk-Branch Generative Adversarial Networks'


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generating 3D faces with high-quality texture, shape and normals is of paramount importance in computer graphics, movie post-production, computer games etc. Other applications of such approaches include generating synthetic training data for face recognition 

[gecer2018facegan] and modeling the face manifold for 3D face reconstruction [gecer2019ganfit]. Currently, 3D face generation in computer games and movies is performed by expensive capturing systems or by professional technical artists. The current state-of-the-art methods generate faces, which can be suitable for applications such as caricature avatar creation in mobile devices [hu2017avatar] but do not generate high-quality shape and normals that can be used for photo-realistic face synthesis. In this paper, we propose the first methodology for high-quality face generation that can be used for photo-realistic face synthesis (i.e., joint generation of texture, shape and normals) by capitalising on the recent developments on Generative Adversarial Networks (GANs).

The early face models such as [blanz1999morphable] represents 3D face by disentangled PCA models of geometry, expression [cao2013facewarehouse] and colored texture, called 3D morphable models (3DMM). 3DMMs and its variants were the most popular method for modelling shape and texture separately. However, the linear nature of PCA is often unable to capture high frequency signals properly and, thus the quality of generation and reconstruction by PCA is sub-optimal.

GANs is a recently introduced family of techniques that train samplers of high-dimensional distributions [goodfellow2014generative]. It has been demonstrated that when a GAN is trained on facial images, it can generate images that have main realistic characteristics. In particular, the recently introduced GANs [karras2017progressive, karras2018style, brock2018large] can generate photo-realistic high-resolution faces. Nevertheless, because they are trained on 2D images, they cannot properly model the manifold of faces and thus (a) inevitably create many unrealistic instances and (b) it is not clear how they can be used to generate photo-realistic 3D faces.

Recently, GANs have been applied for generating facial texture for various applications. In particular, [sela2017unrestricted] and [gecer2018facegan] utilize style transfer GANs to generate photorealistic images of 3DMM-sampled novel identities. [slossberg2018high] directly generates high quality 3D facial textures by GANs and [gecer2019ganfit] replaces 3D Morphable Models (3DMMs) with GAN models for 3D texture reconstruction while shape is still maintained by statistical models. On the other hand, [moschoglou20193dfacegan] model 3D shape by GANs in a parametric UV map and [ranjan2018generating]

utilize mesh convolutions with variational autoencoders to model shape in its original structure. Although one can model 3D faces with such shape and texture GAN approaches, these studies omit the correlation between shape, normals and texture which is very important for photorealism in identity space. The significance of such correlation is most visible with inconsistent facial attributes such as age, gender and ethnicity (i.e. old-aged texture on a baby face geometry).

In order to address these gaps, we propose a novel multi-branch GAN architecture that preserves the correlation between different 3D modalities (such as texture, shape, normals, and expression). After converting all modalities into UV space and concatenate over channels, we train a GAN that generates all modalities in a meaningful local and global correspondence. In order to prevent incompatibility issues due to the intensity distribution of different modalities, we propose a trunk-branch architecture that can synthesize photorealistic 3D faces with coupled texture and geometry. Further, we condition this GAN by expression labels to generate faces in any desired expression.

From a computer graphics point of view, a photorealistic face rendering requires a number of elements to be tailored, i.e. shape, normals and albedo maps, some of which should or can be specific to a particular identity. However, the cost of hand-crafting novel identities limits their usage on applications that requires a large number of identities. The proposed approach tackles this down with reasonable photorealism with massively generalized identity space. Although the results in this paper are limited to aforementioned modalities by the dataset at hand, the proposed method allows adding more identity-specific modalities (i.e. cavity, gloss, scatter) once such dataset becomes available.

The contributions of this paper can be summarized as following:

  • We propose to model and synthesize coherent 3D faces by jointly training a novel Trunk-branch based GAN architecture for shape, texture and normals modalities. TB-GAN is designed to maintain correlation while tolerating domain-specific differences of the three modalities and can be easily extended to other modalities and domains.

  • In the domain of identity-generic face modelling, we believe this is the first study that utilizes normals as an additional source of information.

  • We propose the first methodology for face generation that correlates expression and identity geometries (i.e. modelling personalized expression) and also the first attempt to model expression in texture and normals space.

2 Related Work

2.1 3D face modelling

There is an underlying assumption that human faces lie on a manifold with respect to the appearance and geometry. As a result, one can model the geometry and appearance of the human face analytically based upon the identity and expression space of all individuals. Two of the first attempts in the history of face modeling were [akimoto1993automatic], which proposes part-based 3D face reconstruction from frontal and profile images, and [platt1981animating], which represents expression action units by a set of muscle fibers.

Twenty years ago methods that generated 3D faces revolved around parametric generative models that are driven by a small number of anthropometric statistics (e.g., sparse face measurements in a population) which act as constraints [decarlo1998anthropometric]. The seminal work of 3D morphable models (3DMMs) [blanz1999morphable] demonstrated for the first time that is possible to learn a linear statistical model from a population of 3D faces [patel20093d, brunton2014review]

. 3DMMs are often constructed by using a Principal Component Analysis (PCA) based on a dataset of registered 3D scans of hundreds 

[paysan20093d] or thousands [booth2018large] subjects. Similarly, facial expressions are also modeled by applying PCA [yang2011expression, li2017learning, breidt2011robust, amberg2008expression], or are manually defined using linear blendshapes [li2010example, thies2015real, bouaziz2013online]. 3DMMs, despite their advantages, are bounded by the capacity of linear space that under-represents the high-frequency information and often result in overly-smoothed geometry and texture models. Furthermore, the 3DMM line of research assumes that texture and shape are uncorrelated, hence they can only be produced by separate models (i.e., separate PCA models for texture and shape). Early attempts in correlated shape and texture have been made in Active Appearance Models (AAMs) by computing joint PCA models of sparse shape and texture [cootes2001active]. Nevertheless, due to the inherent limitations of PCA to model high-frequency texture PCA is rarely used to correlate shape and texture for 3D face generation.

Recent progress in generative models [kingma2013auto, goodfellow2014generative] is using 3D face modelling to tackle this issue. [moschoglou20193dfacegan] trained a GAN that models face geometry based on UV representations for neutral faces and likewise, [ranjan2018generating] modelled identity and expression geometry by variational autoencoders with mesh convolutions. [gecer2019ganfit] proposed a GAN-based texture modelling for 3D face reconstruction while modelling geometry by PCA and [slossberg2018high] trained a GAN to synthesize facial textures. To the best of our knowledge, these methodologies totally omits the correlation between geometry and texture and moreover they ignore identity-specific expression modelling by decoupling them into separate models. In order to address this issue, we propose a trunk-branch GAN that is trained jointly for texture, shape, normals and expression in order to leverage non-linear generative networks for capturing the correlation between these modalities.

2.2 Photorealistic face synthesis

Although most of the aforementioned 3D face models are safe to synthesize 2D face images, there are also some worth-mentioning 2D face generation studies. [mohammed2009visio] combine non-parametric local and parametric global models to generate various set of face images. Recent family of GAN approaches [radford2015unsupervised, karras2017progressive, karras2018style, brock2018large] offers the state-of-the-art high quality random face generation without constraints.

Some other GAN-based studies allow to condition synthetic faces by rendered 3DMM images [gecer2018facegan], by landmarks [bazrafkan2018face] or by another face image [bao2018towards] (i.e. by disentangling identity and certain facial attributes). Similarly, facial expression is also conditionally synthesized by an audio input [jamaludin2019you], by action unit codes [pumarola2018ganimation], by predefined 3D geometry [zhang2005geometry] or by expression of an another face image [li2012data].

In this work, we jointly synthesize aforementioned modalities for coherent photorealistic face synthesis by leveraging high-frequency generation by GANs. Unlike many of its 2D and 3D alternatives, the resulting generator models provide absolute control over disentangled identity, pose, expression and illumination spaces. Unlike many other GAN works that are struggling due to misalignments among the training data, our entire latent space correspond realistic 3D faces as the data representation is naturally aligned on UV space.

2.3 Boosting face recognition by synthetic training data

There have been also some works to synthesize face images to be used as synthetic training data for face recognition methods either by directly using GAN-generated images [trigueros2018generating] or by controlling pose-space with a conditional-GAN [tran2018representation, hu2018pose, shen2018faceid]. [masi2016we] propose many augmentation techniques, such as rotating, changing expression and shape, based on 3DMMs. Other GAN-based approaches that capitalize 3D facial priors include [zhao2017dual], which rotates faces by fitting 3DMM and preserves photorealism by translation GANs and [yin2017towards] which frontalize face images by an end-to-end translation framework that consist of 3DMM regression network and adversarial supervision. [deng2018uv] complete missing parts of UV texture representations of 2D images after 3DMM fitting by a translation GAN. [gecer2018facegan]

first synthesize face images of novel identities by sampling from 3DMM and then remove photorealistic domain gap by a image-to-image translation GAN.

All of these studies show the significance of photorealistic and identity-generic face synthesization for the next generation of facial recognition algorithms. Although this study focus more on graphical aspect of face synthesization, we show that the synthetic images can also improve face recognition performance.

2.4 Person-specific face models

There have been a number of studies that propose to model appearance and geometry of only a single or a few identities with an excellent quality. [guenter2005making] utilize appearance models for the identities whose large number of images captured in a controlled environment. Similarly, [lombardi2018deep] propose deep appearance models by variational autoencoders. Even though it produces very high rendering quality with various expressions, this method also requires to capture 20 million images of subjects in a controlled environment. Although [cao2016real]

reduce this number to a few in-the-wild images, the quality of identity geometry and appearance is limited with the number and quality of the provided images. Nevertheless, all these studies model up to a few individuals at the same time either by interpolating different captures or training a person-specific generative networks. Our method differs from these methods by its capability to generalize on identity space. It reduces the dimensionality of a 3D mesh with

50,000 nodes into a vector of 512 in latent space that generalize very well with different identities and expression.

3 Approach

3.1 UV Maps for Shape, Texture and Normals

Fig. 2: UV extraction process. In (a) we present a raw mesh, in (b) the registered mesh using the Large Scale Face Model (LSFM) template [booth20163d], in (c) the unwrapped 3D mesh in the 2D UV space, and finally in (d) the interpolated 2D UV map. Interpolation is carried out using the barycentric coordinates of each pixel in the registered 3D mesh.

In order to feed the shape, the texture and the normals of the facial meshes into a deep network we need to reparameterize them into an image-like tensor format in order to apply 2D-convolutions

111Another line of research is to use convolutions directly on the 3D mesh structure. Nevertheless, when these lines were written the state-of-the-art mesh convolutional networks, e.g. [cheng2019meshgan, litany2018deformable, ranjan2018generating], were not able to preserve high-frequency details of the texture and normals.. We begin by describing all the raw 3D facial scans with the same topology and number of vertices (dense correspondence). This is achieved by morphing non-rigidly a template mesh to each one of the raw scans. We employ a standard non-rigid iterative closest point algorithm as described in [amberg2007optimal, de2010optimal] and we deform our chosen template so that it captures correctly the facial surface of the raw scans. As a template mesh we choose the mean face of the LSFM model proposed in [booth20163d], which consists approximately of vertices that are sufficient enough to depict non-linear, high facial details.

After reparameterizing all the meshes into the LSFM [booth20163d] topology, we cylindrically unwrap the mean face of the LSFM [booth20163d] to create a UV representation for that specific mesh topology. In literature, a UV map is commonly utilized for storing only the RGB texture values. Apart from storing the texture values of the 3D meshes, we utilize the UV space to store the 3D coordinates of each vertex and the normal orientation . Before storing the 3D coordinates into the UV space, all meshes are aligned in the 3D spaces by performing General Procrustes Analysis (GPA) [gower1975generalized] and are normalized to be in the scale of . Moreover, we store each 3D coordinate and normals in the UV space given the respective UV pixel coordinate. Finally, we perform a barycentric interpolation based on the barycentric coordinates of each pixel on the registered mesh to fill out the missing areas in order to produce a dense illustration of the UV map. In Fig. 2, we illustrate a raw 3D scan, the registered 3D scan on the LSFM [booth20163d] template, the sparse UV map of 3D coordinates and finally the interpolated one.

3.2 Generative Adversarial Networks

Recent advances in generative models have achieved impressive performance in synthesizing diverse set of photorealistic images [goodfellow2014generative, radford2015unsupervised, kingma2013auto]. Particularly, variants of GANs perform quite well on various generative applications including style/domain transfer [zhu2017unpaired]

, super-resolution 

[ledig2017photo], pose/label guided image generation [tran2018representation]

, image inpainting/editing 

[yeh2017semantic] etc. for face [karras2017progressive, karras2018style], body [ma2017pose] and natural [brock2018large]

images. In order to achieve variation and photorealism, GANs are being carefully trained by a zero-sum game loss function between two competitor networks: Generator and Discriminator. That is to say, while the generator is being trained to generate samples similar to the training data, the discriminator is trained to separate artificial and real training images from the training set. Both networks improve its performance by benefiting action of one another over the training. At the end, generator network can synthesize photorealistic samples that are aligned with the distribution of the training set.

3.3 Trunk-Branch GAN to Generate Coupled Texture, Shape and Normals

Fig. 3: An overview of the proposed trunk-branch network to generate multiple modalities to render a more photo-realistic face images. The network is designed to generate correlated texture, shape and normals UV maps of novel identities. The separation for different modalities allows branch-networks to specialize the characteristic of each one of the modalities while the trunk network maintain local correspondences among them.

In order to train a model that handles multiple modalities, we propose a novel trunk-branch architecture to generate entangled modalities of 3D face such as texture, shape and normals as UV maps. For this task we exploit the MeIn3D dataset [booth20163d] which consists of approximately 10,000 neutral 3D facial scans with wide diversity in age, gender, and ethnicity.

Given a generator network with a total of convolutional upsampling layers and gaussian noise as input, the activation at the end of layer (i.e., ) is split into three branch networks , , each of which consist of upsampling convolutional layers that generate texture, normals and shape UV maps respectively. The discriminator starts with the branch networks , , whose activations are concatenated before fed into trunk discriminator . The output of is typically regression of real/fake score (i.e. 1 indicating realism).

Although the proposed approach is competible with most of the GAN architectures and loss functions, in our experiments, we use progressive growing GAN architecture [karras2017progressive] trained by WGAN-GP Wasserstein loss function [gulrajani2017improved] as following:

(1)
(2)
(3)

where denotes uniform random numbers between 0 and 1. is a balancing factor which is typically . An overview of this trunk-branch architecture is illustrated in Fig. 3

3.4 Expression Augmentation by Conditional GAN

Fig. 4: Overview of expression-conditioned Trunk-Branch GAN. We annotate training dataset automatically by an expression recognition network and use output expression encodings as label. The generator network learns to couple those expression encodings to the texture, shape and normals UV maps. The generator and discriminator networks have the same architectures as Fig. 3 and abbreviated here.

Further, we modify our GAN in order to generate 3D faces with expression by conditioning it with expression annotations. Similarly to the MeIn3D dataset, we have captured approximately facial scans of around distinct identities during a special exhibition in the Science Museum, London. All subjects were recorded in various guided expressions with a 3dMD face capturing apparatus. All of the subjects were asked to provide meta-data regarding their age, gender, and ethnicity. The database consists of male, female, White, Asian, Mixed Heritage, Black and other. In order to avoid the cost and potential inconsistency of manual annotation, we render those scans and automatically annotate them by an expression recognition network. The resulting expression encodings are used as label vector during the training of our trunk-branch conditional GAN. This training scheme is illustrated in Fig. 4.

Unlike previous expression models which omits the effect of expression on textures, the resulting generator is capable of generating coupled texture, shape and normals map of a face with controlled expression. Similarly, our generator respects the identity-expression correlation thanks to correlated supervision provided by the training data. This is in contrast to the traditional statistical expression models which decouples expression and identity models into two separate entities.

3.5 Photorealistic Rendering with Generated UV maps

For the final rendering to appear photorealistic, we use the generated identity-specific mesh, texture and normals, in combination with generic modalities the reflectance properties and employ a commercial rendering application. We use Marmoset Toolbag [Marmoset19:Toolbag], which performs real-time forward rendering that is highly parameterisable and allows the control of a wide-variety of reflectance models and modalities, such as subsurface scattering, specular reflection and high-frequency normals.

In order to extract the 3D representation from the UV domain we employ the inverse procedure explained in section 3.1 based on the UV pixel coordinates of each vertex of the 3D mesh. Fig. 6 shows the rendering results, under a single light source, when using the generated geometry (Fig. 6(a)) and the generated texture (Fig. 6(b)). Here the specular reflection is calculated on the per-face normals of the mesh and exhibits steep changes between on the face’s edges. By interpolating the generated normals on each face (Fig. 6(c)), we are able to smooth the specular highlights and correct any high-frequency noise on the geometry of the mesh. However, these results do not correctly model the human skin, and resemble a metallic surface. In reality, the human skin is rough and as a body tissue, it both reflects and absorbs light, thus exhibiting specular reflection, diffuse reflection and subsurface scattering.

Although we can add such modalities, to our multi-branch GAN with the availability of such data, we find that rendering can be still improved by adding some identity-generic maps for rendering. Using our training data, we create maps that define certain reflectance properties per-pixel, which will match the features of the average generated identity, as shown in (Fig. 5). Scattering (c) defines the intensity of subsurface scattering of the skin. Translucency (d) defines the amount of light, that travels inside the skin and gets emitted in different directions. Specular albedo (e) gives the intensity of the specular highlights, which differ between hair-covered areas, the eyes and the teeth. Roughness (f) describes the scattering of specular highlights and controls the glossiness of the skin. A detail normal map (g) is also tilled and added on the generated normal maps, to mimic the skin pores and a detail weight map (h) controls the appearance of the detail normals, so that they do not appear on the eyes, lips and hair.

The final result (Fig. 6(d)) properly models the skin surface and reflection, by adding plausible high-frequency specularity and subsurface scattering, both weighted by the area of the face where they appear.

Fig. 5: Overview of the texture maps used for rendering. The generated id-specific texture (a) and normal maps (b) are mapped to the base geometry during the rendering process. Moreover, we enhance the rendered results by using identity-generic modalities, that describe in -space the scattering (c), translucency (d), specular intensity (e), roughness (f), detail normals (g) and their weights (h), and are derived from the training data.
(a) shp
(b) shp+tex
(c) shp+tex+nor
(d) photorealistic
Fig. 6: Zoom-in on rendering results when using (a) only the geometry, (b) adding the albedo texture, (c) adding the generated mesoscopic normals and (d) using identity-generic detail normal, specular albedo, roughness, scatter and translucency maps.

4 Results

In this section, we give qualitative and quantitative results of our method for generating 3D faces with novel identities and various expressions. In our experiments, we set and , meaning that there are 8 upsampling/downsampling layers in total, 6 of them in the trunk, 2 of them in each branch. These choices are empirically validated to ensure sufficient correlation among modalities without incompatibility artifacts. Running time is a few milliseconds to generate UV images from a latent code on a high-end GPU. Transforming from UV image to mesh is just sampling with UV coordinates and can be considered free of cost. Renderings in this paper take a few seconds due to high resolution but this cost depends on the application. The memory needed for the generator network is 1.25GB compared to the 6GB PCA model of the same resolution which contains of the total entropy.

In the following sections, we first visualize unwrapped UV representations of the generated modalities and their contributions to the final renderings on a number of generated faces. Next, we show the generalization ability of the identity and expression generators by means of number of attributes. We also demonstrate its well-generalization latent space by interpolating between different identities. Additionally, we perform full-head completion to the interpolated faces. Finally we perform a number of face recognition experiments by using the generated face images as additional training data.

4.1 Qualitative Results

4.1.1 Combining coupled modalities:

Fig. 7 presents the generated shape, normals and texture maps by the proposed GAN and their additive contributions to the final renderings. As can be seen local and global correspondences, the generated UV maps are highly correlated and coherent. Attributes like age, gender, race etc. can be easily grasped from all of the UV maps and rendered images. Please also note that some of the minor artefacts of the generated geometry in Fig. 7(d) are compensated by the normals in Fig. 7(e).

(a) Shape UV
(b) Normals UV
(c) Texture UV
(d) Shape Rendered
(e) Shape+Normals
(f) Shp.+Nor.+Tex.
Fig. 7: Generated UV representations and their corresponding additive renderings. Please note the strong correlation between UV maps, high fidelity and photorealistic renderings. The figure is best viewed in zoom.

4.1.2 Diversity:

Our model is well-generalized with different age, gender, ethnicity groups and many facial attributes. Although Fig. 8 shows diversity in some of those categories, the reader is encouraged to see identity variation throughout the paper. From looking at the varition demostrated, we can safely claim that our models are free of global or local mode collapse of which many GAN studies are struggling. Please refer to the supplementary video to enjoy more diversity.

(a) Age
(b) Ethnicity
(c) Gender
(d) Weight
(e) Roundness
Fig. 8: Variation of generated 3D faces by our model. Each column show diverse images of a different aspect.

4.1.3 Expression:

We also show that our expression generator is capable of synthesizing quite a diverse set of expressions. Moreover, the expressions can be controlled by the input label as can be seen in Fig. 9. The reader is encouraged to see more expression generations in the supplementary video.

(a) Happy
(b) Sadness
(c) Anger
(d) Fear
(e) Disgust
(f) Surprise
Fig. 9: First and forth rows shows generations of different expression categories. The other rows show texture and normals maps used to generate the corresponding 3D faces. Please note how expressions are represented in the texture and normals space.

4.1.4 Interpolation between identities:

As shown in the supplementary video and in Fig. 13, our model can easily interpolate between any generation in a visually continuous set of identities which is another indication that the model is free from mode collapse. Interpolation is done by randomly generating two identities and generates faces by evenly spaced samples in latent space between the two.

4.1.5 Full head completion

We also extend our facial 3D meshes to full head representations by employing the framework proposed in [ploumpis2019combining]. We achieve this by regressing from a latent space that represents only the 3D face to the PCA latent space of the Combined Face and head model (CFHM) [ploumpis2019combining]. We begin by building a PCA model of the inner face based on the neutral scans of the MeIn3D dataset.

Similarly, we exploit the extended full head meshes of the same identities utilized by CFHM model and project them to the CFHM subspace in order to the acquire the latent shape parameters of the entire head topology. Finally, we learn a regression matrix by solving a linear least square optimization problem as proposed in [ploumpis2019combining], which works as a mapping from the latent space of the face shape to the full head representation. In Figure 14 we demonstrate the extended head representations of our approach in conjunction with the faces in Figure 13.

4.1.6 Comparison to decoupled modalities and PCA

Results in Fig. 10 reveal a set of advantages of such unified 3D face modeling over separate GAN and statistical models. Clearly the figure shows that the correlation among texture, shape and normals is an important component for realistic face synthesis. Also generations by PCA models are missing photorealism and high-fidelity significantly.

Fig. 10: Comparison with seperate GAN models and PCA model. (a) Generation by our model. (b) Same texture with random shape and normals. (c) Same texture and shape with random normals (i.e. beard). (d) Generation by a PCA model constructed by the same training data and the same identity-generic rendering tools as explained in Sec.3.5.

4.2 Pose-invariant Face Recognition

In this section we present an experiment that demonstrates that the proposed methodology can generate faces of different and diverge identities. That is, we use the generated faces to train a face recognition network. To do so, we employ the most recent state-of-the-art face recognition method, ArcFace [deng2018arcface], and show that the proposed shape and texture generation model can boost the performance of pose-invariant face recognition in the wild.

Training Data.

We randomly synthesize 10K new identities from the proposed shape and texture generation model and render 50 images per identity with random camera and illumination parameters from the Gaussian distribution of the 300W-LP dataset 

[zhu2016face]. For clearity, we call this dataset “Gen” in the rest of the text. Figure 11 illustrates some examples of “Gen” dataset which show larger pose variations than the real-world collected data. We augment “Gen” with an in-the-wild training data, CASIA dataset [yi2014learning], which consists of 10,575 identities with 494,414 images.

ıin 000000_6,000002_20,000005_14,000035_39,000099_42 ıin 000009_10,000011_43,000011_44,000019_30,000030_5

Fig. 11: Examples of generated data (“Gen”) by the proposed method.

Test Data. For evaluation, we employ Celebrities in Frontal Profile (CFP) [sengupta2016frontal] and Age Database (AgeDB) [Moschoglou2017AgeDB]. CFP [sengupta2016frontal] consists of 500 subjects, each with 10 frontal and 4 profile images. The evaluation protocol includes frontal-frontal (FF) and frontal-profile (FP) face verification. In this paper, we focus on the most challenging subset, CFP-FP, to investigate the performance of pose-invariant face recognition. There are 3,500 same-person pairs and 3,500 different-person pairs in CFP-FP for the verification test. AgeDB [Moschoglou2017AgeDB, deng2017marginal] contains images of distinct subjects. The minimum and maximum ages are and , respectively. The average age range for each subject is years. There are four groups of test data with different year gaps ( years, years, years and years, respectively) [deng2017marginal]. In this paper, we only use the most challenging subset, AgeDB-30, to report the performance. There are 3,000 positive pairs and 3,000 negative pairs in AgeDB-30 for the verification test.

Data Prepossessing. We follow the baseline [deng2018arcface] to generate the normalized face crops () by utilizing five facial points.

Training and testing Details. For the embedding networks, we employ the widely used ResNet architecture (ResNet50) [he2016deep]. After the last convolutional layer, we also use the BN-Dropout-FC-BN [deng2018arcface] structure to get the final - embedding feature. For the hyper-parameter setting, we follow [deng2018arcface] to set the feature scale to 64 and choose the angular margin at . We set the batch size to and train models by MXNet [chen2015mxnet] on four NVIDIA Tesla P40 (24GB) GPUs. On the CASIA dataset, the learning rate starts from and is divided by at 20K, 28K iterations. The training process is finished at 32K iterations. On the combined dataset (CASIA and generation data), we divide the learning rate at 30K and 42K iterations and finish training process at 60K iterations. We set momentum at and weight decay at . During testing, we only keep the feature embedding network without the fully connected layer (160MB) and extract the - features ( ms/face) for each normalized face. Note that, overlap identities between the CASIA data set and the test set are removed for strict evaluations, and we only use a single crop for all testing.

Result Analysis. In Table I, we show the influence of generated data on pose-invariant face recognition. We take UV-GAN [deng2018uv] as the baseline method, which attaches the completed UV texture map onto the fitted mesh and generates instances of arbitrary poses to increase pose variation during training and minimize pose discrepancy during testing. For our experimental settings, we use ([training data, network structure, loss]) to facilitate understanding. As we can see from Table I, generated data significantly boost the verification performance on CFP-FP from to , decreasing the verification error by compared to the result of UV-GAN [deng2018uv]

. On AgeDB-30, combining CASIA and generated data achieves similar performance compared to using single CASIA because we only include intra-variance from pose instead of age.

Methods CFP-FP AgeDB-30
UVGAN [deng2018uv] 94.05 94.18
CASIA, R50, ArcFace 95.56 95.15
CASIA+Gen, R50, ArcFace 97.12 95.18
TABLE I: Verification performance () of different models on LFW, CFP-FP and AgeDB-30.

In Figure 12, we show the angle distributions of all positive pairs and negative pairs from CFP-FP. By incorporating generation data, the overlap indistinguishable area between the positive histogram and the negative histogram is obviously decreased, which confirms that ArcFace can learn pose-invariant feature embedding from the generated data. In Table II, we select some verification pairs from CFP-FP and calculate the angles between these pairs predicted by different models trained from the CASIA and combined data. Intuitively, the angles between these challenging pairs are significantly reduced when generated data are used for the model training.

(a) CASIA
(b) CASIA+Gen
Fig. 12: Angle distributions of CFP-FP pairs in the - feature space. Red area indicates positive pairs while blue indicates negative pairs. All angles between feature vectors are represented in degree.
Training Data
CASIA
CASIA+Gen
TABLE II: The angles between face pairs from CFP-FP predicted by different models trained from the CASIA and combined data. The generated data can obviously enhance the pose-invariant feature embedding.

5 Conclusion

We presented the first neural model for joint texture, shape and normal generation based on Generative Adversarial Networks (GANs). The proposed GAN model implements a new architecture for exploiting the correlation between different modalities. Furthermore, we propose a novel GAN model that can generate different expressions using as noise the embeddings of a facial expression recognition network. We demonstrate that randomly synthesized images of our unified generator shows strong relations between texture, shape and normals and that rendering with normals provides excellent shading and overall visual quality. Finally, in order to demonstrate that our methodology can manage to generate a diverse range of identities we have used a set of generated images to train a deep face recognition network.

ıin 108,109,110,111,112,113,114,115,116,117 ıin 118,119,120,121,122,123,124,125,126,127 ıin 128,129,130,131,132,133,134,135,136,137

Fig. 13: The figure shows the interpolation between pair of identities. Smooth transition indicates generalization of our GAN model.

ıin 108,109,110,111,112,113,114,115,116,117 ıin 118,119,120,121,122,123,124,125,126,127 ıin 128,129,130,131,132,133,134,135,136,137

Fig. 14: Complete full head representations in association with the facial topology corresponding in Fig.13. Even in the full head topology our generation methodology ensures a smooth transition during interpolation.

Acknowledgments

Baris Gecer is funded by the Turkish Ministry of National Education. Stefanos Zafeiriou acknowledges support by EPSRC Fellowship DEFORM (EP/S010203/1) and a Google Faculty Award.

References