MeshGAN: Non-linear 3D Morphable Models of Faces

03/25/2019 ∙ by Shiyang Cheng, et al. ∙ 8

Generative Adversarial Networks (GANs) are currently the method of choice for generating visual data. Certain GAN architectures and training methods have demonstrated exceptional performance in generating realistic synthetic images (in particular, of human faces). However, for 3D object, GANs still fall short of the success they have had with images. One of the reasons is due to the fact that so far GANs have been applied as 3D convolutional architectures to discrete volumetric representations of 3D objects. In this paper, we propose the first intrinsic GANs architecture operating directly on 3D meshes (named as MeshGAN). Both quantitative and qualitative results are provided to show that MeshGAN can be used to generate high-fidelity 3D face with rich identities and expressions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

CoMA    MeshGAN


(a) Exemplar reconstruction results of MeshGAN.

(b) MeshGAN generation of identities and expressions.
Figure 1: Qualitative reconstruction and generation of the proposed MeshGAN. Please zoom in to see more details.

Over the past few years deep Convolutional Neural Networks (CNNs) have emerged as the method of choice for the majority of computer vision tasks that require learning from data

[40, 16, 32]. While the initial use of CNNs was mainly limited to classification/segmentation tasks [16, 32], the introduction of Generative Adversarial Networks (GANs) [27] has expanded the application of deep convolutional architectures to image generation [37, 3, 6, 56]

and image-to-image translation and completion

[35, 67, 18]. Recently, strikingly realistic results have been shown by Nvidia using progressive GANs [37].

Given the success of generative models in images, certainly there is a keen interest in replicating them for geometric data. In order to make convolutions/de-convolutions feasible, the current generative approaches still rely on crude shape approximations. For example, recent approaches either use discrete volumetric representations for the 3D shapes, which result in very low-quality shapes [65], or they apply 1D convolutions combined with fully connected layers [1] which do not take into account the local structure of 3D shapes.

Recently, the field of geometric deep learning on non-Euclidean (graph- and manifold) structured data has gained popularity

[11], with numerous works on generalizing convolutional architectures directly on meshes. Intrinsic generative models are currently a key open question in geometric deep learning. Intrinsic auto-encoder architectures have recently been proposed for human body [45] and face [57] meshes. Nevertheless, due to the lack of appropriate adversarial training, these auto-encoders retain only the low-pass shape information and lose most of the details. Furthermore, contrary to GANs, they do not offer a principled sampling strategy. As of today, we are not aware of any successful intrinsic GAN for 3D mesh generation.

In this paper, we try to bridge this gap with the following contributions:

  • We present the first intrinsic GANs architecture to generate 3D meshes using convolutions directly on meshes. Compared to approaches based on volumetric [65, 1] or point cloud representations, our MeshGAN is able to generate meshes with high level of details.

  • We present the first GAN architecture for 3D face generation. Contrary to the auto-encoder recently proposed in [57] that learns latent spaces where identity and expression are mixed, we can generate expression for arbitrary identities.

  • We conduct quantitative and qualitative experiments to verify the efficacy and effectiveness of MeshGAN on large scale 3D facial data.

2 Related Work

2.1 Geometric deep learning

Geometric Deep Learning (GDL) is an emerging field in machine learning attempting to generalize modern deep learning architectures (such as convolutional neural networks) and the underpinning mathematical principles to non-Euclidean domains such as graphs and manifolds (for a comprehensive survey, the reader is referred to the recent review papers

[11, 31, 5]).

First formulations of neural networks on graphs [28, 58]

preceding the recent renaissance of deep learning, constructed learnable information diffusion processes. This approach has more recently been reformulated using modern tools such as gated recurrent units

[43] and neural message passing [26]. Bruna et al. [12, 33]

proposed formulating convolution-like operations in the spectral domain defined by the eigenvectors of the Laplacian graph. One of the key drawbacks of this approach leading to high computational complexity is the necessity to explicitly perform the Laplacian eigendecomposition. However, if the spectral filter function can be expressed in terms of simple operations (scalar- and matrix multiplications, additions, and inversions), it can be applied directly to the Laplacian avoiding its explicit eigendecomposition altogether. Notable instances of this approach include ChebNets

[22, 39] (using polynomial functions) and CayleyNets [41] (using rational functions); it is possible to generalize these methods to multiple graphs [50] and directed motif-based graph Laplacians [51] using multivariate polynomials.

Another class of graph CNNs are spatial methods, operating on local neighborhoods on the domain [23, 49, 4, 30, 63]. For meshes, the first such architecture (GCNN) used local geodesically polar charts [47]; alternative constructions were proposed using anisotropic diffusion (ACNN) [10] and learnable Gaussian kernels (MoNet) [49]. SplineCNN [24] uses B-spline kernels instead of Gaussians, offering significant speed advantage. FeastNet [64] uses an attention-like soft-assignment mechanism to establish the correspondence between the patch and the filter. Finally, [44] proposed constructing patch operators using spiral ordering of neighbor pixels.

The majority of the aforementioned works focus on extracting features on non-Euclidean data (e.g., graphs, meshes, and etc.) for classification purposes and limited work has been done towards training generative models. One of the fundamental differences between classical Euclidean generative models (such as auto-encoders [38] or Generative Adversarial Networks (GANs) [27]) is the lack of canonical order between the input and the output graph, thus introducing some kind of graph correspondence problem to be solved. In this paper, we deal with the problem of 3D mesh generation and representation on a fixed topology. The setting of fixed topology is currently being studied in computer vision and graphics applications and is significantly easier, since it is assumed that the mesh is given and the vertices are canonically ordered; the generation problem thus amounts only to determining the embedding of the mesh.

The first intrinsic convolutional autoencoder architecture on meshes (MeshVAE) was shown in

[45]. The authors used convolutional operators from [64] and showed examples of human body shape completion from partial scans. A follow-up work CoMA [57] used a similar architecture with spectral Chebyshev filters [22] and additional spatial pooling to generate 3D facial meshes. The authors claim that CoMA can represent better faces with expressions than PCA in a very small dimensional latent space of only eight dimensions. In this paper, we present the first GANs structure for generating meshes of 3D faces with fixed topology.

2.2 Generative adversarial networks

GANs are a promising unsupervised machine learning methodology implemented by a system of two deep neural networks competing against each other in a zero-sum game framework [27]. GANs has become hugely popular owing to their capability of modeling the distribution of visual data and generating new instances that have many realistic characteristics (i.e., preserving the high-frequency details) and look authentic to human observers. Currently, GANs are among the top choices to generate visual data and they are preferable to auto-encoders and VAEs [46].

Nevertheless, the original GANs were criticized for being difficult to train and prone to mode collapse. Different GANs were proposed to tackle these problems. Wasserstein GAN (WGAN) [3]

proposed a new loss function using Wasserstein distance to stabilize the training. In continuation of WGAN, Gulrajani 

et al[29] proposed an alternative way to clip weights, which helped to improve the training convergence and generation quality. Boundary Equilibrium GANs (BEGANs) [6] implemented the discriminator as an auto-encoder whose loss is derived from Wasserstein distance. In that, an equilibrium enforcing method was proposed to balance the training of generator and discriminator. Chang et al[15] further proposed a variant of BEGAN with a Constrained Space (BEGAN-CS). They tried to improve the training stability by adding a latent-space constraint in the loss function. As BEGAN has demonstrated good performance in generating photo-realistic faces, following BEGAN, we meticulously design a generative network for realistic generation of 3D faces.

2.3 3D Facial shape representation and generation

For the past two decades, the method of choice for representing and generating 3D faces is still Principal Component Analysis (PCA). PCA was used for building statistical 3D shape model (

i.e., 3D Morphable Models (3DMMs)) in many works [54, 53, 7]. Recently, PCA is adopted for building large scale statistical models of the 3D face [9] and head [21]. It is very convenient for representing and generating faces to decouple facial identity variations from expression variations. Hence, statistical blendshape models have been introduced which represent only the expression variations using PCA [42, 52] or multilinear methods [13, 8]. Some recent efforts were made to represent facial expressions with deep learning using fully connected layers [60, 62]. Fully connected layers have huge number of parameters and also do not take into account the local geometric of the 3D facial surfaces. The only method that represented faces using convolutions on the mesh domain was the recently proposed mesh auto-encoder CoMA [57]. Nevertheless, the identity and expression latent space of CoMA was mixed. Furthermore, the representative power and expressiveness of the model is somewhat limited because it was trained on only 12 subjects displaying 12 classes of extreme expressions. In this paper, we train deep generative graph convolutional neural networks (DGCNs) using spectral mesh convolutions that individually model identity and expression on large scale data.

Figure 2: Network architecture of the proposed MeshGAN.
Figure 3: Our pipeline to generate random 3D face with expression.

3 Proposed Approach

In this part, we define the mesh convolution operators, describe our encoder and decoder/generator and layout our MeshGAN architecture for non-linear generation of 3D faces.

3.1 Data representation

We represent the facial surface as a manifold triangular mesh where each edge belongs to at most two triangle faces and (here, we denote by and the interior and boundary edges, respectively). An embedding of is realised by assigning 3D coordinates to the vertices , which are encoded as a matrix containing the vertex coordinates as rows. The discrete Riemannian metric is defined by assigning a length to each edge .

The Laplacian operator is discretised (using the distance-based equivalent of the cotangent formula [36, 48]) as an matrix , where is a diagonal matrix of local area elements , and is a symmetric matrix of edge-wise weights, defined in terms of the discrete metric:

The Laplacian admits an eigen decomposition with -orthonormal eigenvectors

and non-negative eigenvalues

arranged into a diagonal matrix .

3.2 Spectral mesh convolutions

Let be a scalar real function defined on the vertices of the mesh, represented as an

-dimensional vector. The space of such functions is a Hilbert space with the standard inner product

. The eigenvectors of the Laplacian form an orthonormal basis in the aforementioned Hilbert space, allowing a Fourier decomposition of the form , where is the Fourier transform of . The Laplacian eigenvectors thus play the role of standard Fourier atoms and the corresponding eigenvalues that of the respective frequencies. Finally, a convolution operation can be defined in the spectral domain by analogy to the Euclidean case as .

Spectral graph CNNs. Bruna et al. [12] exploited the above formulation for designing graph convolutional neural networks, in which a basic spectral convolution operation has the form , where is a diagonal matrix of spectral multipliers representing the filter and is the filter output. Among notable drawbacks of this architecture putting it at a clear disadvantage compared to classical Euclidean CNNs are: high computational complexity ( due to the cost of computing the forward and inverse graph Fourier transform, incurring dense matrix multiplication), parameters per layer, and no guarantee of spatial localization of the filters.

ChebNet. Defferrard et al. [22] considered the spectral CNN framework with polynomial filters represented in the Chebyshev basis, , where denotes the Chebyshev polynomial of degree , with and . A single filter of this form can be efficiently computed by applying powers of the Laplacian to the feature vector,

(1)

thus avoiding its eigendecomposition altogether. Here is a frequency rescaled in , is the rescaled Laplacian with eigenvalues . The computational complexity thus drops from as in the case of spectral CNNs to , since the mesh is sparsely connected.

3.3 MeshGAN

We introduce MeshGAN, a variant of BEGAN [6], that can learn a non-linear 3DMM directly from the 3D meshes. Specifically, we employ the aforementioned ChebNet to build our discriminator and generator .

3.3.1 Boundary equilibrium generative adversarial networks

The main difference between BEGAN and typical GANs is that, BEGAN uses an auto-encoder as the discriminator, as it tries to match the auto-encoder loss distribution rather than the data distributions. This is achieved by adding an extra equilibrium term . More precisely, this hyper-parameter is used to maintain the balance of the loss expectation of discriminator and generator (i.e., ). The training objective of BEGAN is as follows:

where is the uniform random vector of dimension (aka. the latent vector of generator), and are the trainable parameters of the discriminator and generator respectively; is the discriminator loss, for which we select loss in this paper. In each training step , variable is utilised to control the influence of the fake loss on discriminator; can be regarded as the learning rate of , which is set to 0.001. Berthelot et al[6] found out that has a decisive impact on the diversity of generated images, that is, lower values tends to produce mean face-alike images. To encourage more variations, we empirically set to 0.7.

3.3.2 MeshGAN architecture

Based on the architecture of BEGAN, we developed MeshGAN using ChebNet [22, 39]. The architecture of MeshGAN is illustrated in Fig. 3. We follow a similar design of CoMA for building our encoder and generator/decoder, 4 Chebyshev convolutional filters with = 6 polynomials are used in the encoder. Nevertheless, after each convolution layer, we select ELU [19]

as the activation function to allow the passing of negative values. The mesh down-sampling step is performed by the surface simplification method in 

[25], which minimises the quadric error when decimating the template. Up-sampling of the template is based on the barycentric coordinates of the contracted vertices in decimated mesh [57]. In total, we perform 4 levels of down-sampling, with each level lowering the number of vertices by approximately 4 times. To allow for more representation powers, we set the bottleneck of discriminator to be 64, equal to the dimension of feature embedding in generator. Momentum optimizer [55] is employed, with the learning rate being and decay rate being

. We train all the models with 300 epochs. Note that skip connections between the output of fully connected layer and each up-sampled graph can be applied to encourage more facial details.

4 Experiments

4.1 3D face databases

3dMD: For identity model training, we used recently collected 3dMD datasets scanned by the high resolution 3dMD device111http://www.3dmd.com/. We selected around 12,000 unique identities from this database, with different ethnic groups (i.e., Chinese, Caucasian, Black people) and age groups presented.

4DFAB: To train expression models, we use the 4DFAB database [17], which is the largest dynamic 3D face database that contains both posed and spontaneous expressions. In 4DFAB, participants were invited to attend four experiment sessions at different times. In each session, participants were asked to articulate 6 basic facial expressions, and then watched several emotional videos. Annotation of apex posed expression frames as well as the expression category of spontaneous sequences were provided. To ensure the richness of expressions in our training set, we randomly sampled 6,651 apex posed expression meshes and 7,567 spontaneous expression meshes from 4DFAB.

For each database, we train the CoMA and MeshGAN model with the corresponding data. We label the models that are trained on 3dMD database with -ID, whereas the models trained on 4DFAB database are appended with -EXP.

4.1.1 Data pre-processing

To balance the fineness and complexity of model, we cropped and decimated the LSFM model [9], and generated a new 3D template with 5,036 vertices. In order to bring all the data into dense correspondence with the template, we employed Non-rigid ICP [2] to register each mesh. We automatically detected 79 3D facial landmarks with the UV-based alignment method developed in [17], and utilised these landmarks to assist dense registration. Unless otherwise stated, we divided each database into training and testing sets with a split ratio of 9:1.

On a separate note, in order to train the expression models, we need to decouple facial identity from every expression mesh in 4DFAB. This was achieved by manually selecting one neutral face per subject per session in 4DFAB, and subtracting the expression mesh with its corresponding neutral face to obtain the facial deformation. We then exerted this deformation on the 3D template to generate a training set with pure expressions. Note that a local surface-preserving smoothing step [61] was undertaken to further remove identity information as well as noises.

Figure 4: Extrapolation of the identity model. First 2 rows are examples of exaggerating ethnicity (Black people and Chinese). Last 2 rows display exaggerated ages (children and elderly people). Please check our supplementary material for extrapolation result of gender.
Figure 5: Extrapolation of the expression model. First 2 rows are examples of exaggerating anger and disgust. Last 2 rows are extrapolation of sad and happy. Please check our supplementary material for other expressions.
Methods Generalisation Specificity FID
CoMA-ID 0.4420.116 1.600.228 14.24
MeshGAN-ID 0.4650.189 1.4330.144 10.82

Table 1: Intrinsic evaluation of identity models. Average generalisation and specificity errors are measured in mm.
Methods Generalisation Specificity FID
CoMA-EXP 0.6060.203 1.8990.272 22.43
MeshGAN-EXP 0.6050.264 1.5360.153 13.59

Table 2: Intrinsic evaluation of expression models. Average generalisation and specificity errors are measured in mm.

Figure 6: 3D facial expression recognition on exaggerated expressions generated by extrapolating latent space of CoMA and MeshGAN.

4.2 Intrinsic evaluation of MeshGAN

We gave a quantitative evaluation of MeshGAN’s generator, whose counterpart is the decoder of CoMA. The intrinsic characteristics of the models include generalisation capability, specificity [14, 8], as well as FID score [34].

Generalisation. The generalisation measures the ability of a model to represent/reconstruct unseen face shapes that are not present during training. To compute the generalisation error, we computed the per-vertex Euclidean distance between every sample of the test set and its corresponding reconstruction by the generator :

(2)

After that, we took the average value over all vertices and all test samples. This procedure was conducted separately on identity and expression models. We reported the mean and standard deviation of the reconstruction errors in Table 

1 and Table 2

. It can be seen that both methods achieved similar performance in reconstructing facial expressions (MeshGAN-ID achieved 0.605mm, while CoMA-ID produced 0.606mm), whereas CoMA is slightly better in describing unseen identity (0.023mm lower in error). This is probably attributed to the fact that auto-encoder is specifically trained to reconstruction data examples, while BEGAN is not. We leave this as our future investigation, and refer the readers to 

[20, 66].

Specificity. The specificity of a model evaluates the validity of generated faces. For each model, we randomly synthesised 10,000 faces and measured the proximity between them and the real faces in test set. More precisely, for every randomly generated face, we found its nearest neighbor in the test set, in terms of minimum (over all samples of the test set) of the average per-vertex distance. We recorded the mean and standard deviation of this distance over all random samples as the specificity error

. Note that we randomly sampled MeshGAN with the uniform distribution

, whereas we facilitated CoMA with a multivariate Gaussian distribution

estimated from the features embedding of the training data in CoMA (using Eq. 2). Table 1 and Table 2 also display the specificity errors for different models. We observed that in all the cases, MeshGAN attained particularly low errors against CoMA, i.e., 0.17mm lower in identity, 0.36mm lower in expression. This is a quantitative evidence that the synthetic faces generated by MeshGAN models are more realistic than those of CoMA.

Fréchet Inception Distance (FID). FID [34] is a reliable measurement on the quality and diversity of the images generated by GANs. To compute FID score, we borrowed the pre-trained Inception network [59] to extract features from an intermediate layer and then modelled the distribution of these features using a multivariate Gaussian . As Inception network is trained on 2D images, we rasterised each 3D mesh (with lambertian shading) into a 6464 image and fed it to the network. The FID score between the real images and generated images is computed as:

where and are the multivariate Gaussians estimated from the inception feature of the real and generated images respectively. The smaller the FID values are, the better the image quality and diversity would be. It has to be mentioned that, when sampling the latent space of CoMA, we did not estimate the multivariate Gaussian beforehand, as the training data distribution is not supposed to be revealed here. Hence, we used a standard Gaussian to sample latent space of CoMA, meanwhile for the MeshGAN, we always use the uniform distribution . We show the FID scores of CoMA and MeshGAN in Table 1 and Table 2. We can observe that FID scores of MeshGAN are significantly lower than those of CoMA in both cases. This is another strong evidence that MeshGAN can generate meshes with richer variations and better quality than auto-encoders.

As a matter of fact, we also experimented with different GANs (such as the original GANs [27], WGAN [3] and BEGAN-CS [15]) in the same architectures as MeshGAN. Unfortunately, they did not achieve any comparable performances with CoMA or BEGAN. Due to limited space, we put this ablation study in the supplementary material.

Figure 7: Interpolation and extrapolation between two generated meshes (i.e., the anchor meshes marked in red box). The blue box contains interpolation results between two anchor meshes, the outside faces are extrapolation results.

4.3 Extrapolating identity and expression model

We first extrapolated the latent vector of the identity model and visualised the exaggerated synthetic examples. Given a pair of meshes and , we estimated the feature embedding (denoted as ) using Eq. 2. After that, we computed the extrapolated latent vector using a non-convex combination of two vectors and :

(3)

Here, we fixed mesh to be the neutral template, while was the target face reconstructed by MeshGAN and CoMA, separately. Fig. 4 shows the extrapolation results of the identity model in terms of ethnicity and age (note that we increased from to ). We can clearly observe that: (a) MeshGAN can better describe the subtle facial details (e.g., eyes and lips); (b) CoMA produces highly distorted and grotesque faces (e.g., disproportionate nose, incorrect exaggeration of ethnicity and age) as the extrapolation proceeds, whereas MeshGAN did not have such issues.

For the extrapolation of expression models, we followed the same approach and showed the results in Fig. 5. Obviously, MeshGAN is more capable of representing different facial expressions, especially the facial muscle movement (e.g

., disgust in the first row). Compared with CoMA, the exaggerated expressions from MeshGAN are still quite meaningful and realistic. To quantitatively evaluate the semantic correctness of exaggerated expressions, we trained a 3D expression classifiers using SplineCNN 

[24]. We built this FER network with 4 convolution layers: SConv(,1,16)Pool(4)SConv(,16,16)Pool(4)
SConv(,16,16)Pool(4)SConv(,16,32)Pool(4)
FC(6), where are the B-spline kernel sizes. ELU [19] is used after each convolution and fully connected layer. We trained the network with 80 epochs, learning rate and epoch size equal to 0.0001 and 16, respectively. The Pool() operation is exactly the same as MeshGAN. For FER training, we prepared around 6k posed expression meshes (6 expressions, each has nearly 1k samples) from 4DFAB, which are not present in training set of expression model. We testified the exaggerated expressions produced by different extrapolating factor (ranged from 1 to 3). We plotted the recognition rate for each as a curve in Fig. 6. Interestingly, as the degree of extrapolation increases, the recognition rate for CoMA drastically declines, while MeshGAN decreases comparatively slowly. This further proves that MeshGAN can still provide meaningful expressions even when sampling beyond the normal range.

4.4 Qualitative results

We used the pipeline in Fig. 3 to generate 3D identities with expressions. Qualitative results are shown in Fig. 1 (b). To visualise the interpolation and extrapolation between/beyond two faces, we synthesised two identities with different expression and used them as the anchor faces. Following Eq. 3, we varied the parameters of identity and expression models by separate factors and . By using the grid of interpolated/extrapolated parameters, we synthesised the corresponding faces and displayed them in Fig. 7.

5 Conclusion

We presented the first GANs capable of generating 3D facial meshes of different identities and different expressions. We have experimentally and empirically demonstrated that the proposed MeshGAN can generate 3D facial meshes with more subtle details than the state-of-the-art auto-encoders. Finally, we show that the proposed MeshGAN can model the distribution of faces better than auto-encoders, hence it leads to better sampling strategies.

References