Convolutional Mesh Autoencoders for Generating 3D Faces
Learned 3D representations of human faces are useful for computer vision problems such as 3D face tracking and reconstruction from images, as well as graphics applications such as character generation and animation. Traditional models learn a latent representation of a face using linear subspaces or higher-order tensor generalizations. Due to this linearity, they can not capture extreme deformations and non-linear expressions. To address this, we introduce a versatile model that learns a non-linear representation of a face using spectral convolutions on a mesh surface. We introduce mesh sampling operations that enable a hierarchical mesh representation that captures non-linear variations in shape and expression at multiple scales within the model. In a variational setting, our model samples diverse realistic 3D faces from a multivariate Gaussian distribution. Our training data consists of 20,466 meshes of extreme expressions captured over 12 different subjects. Despite limited training data, our trained model outperforms state-of-the-art face models with 50 We also show that, replacing the expression space of an existing state-of-the-art face model with our autoencoder, achieves a lower reconstruction error. Our data, model and code are available at http://github.com/anuragranj/coma.READ FULL TEXT VIEW PDF
Generative models for 3D geometric data arise in many important applicat...
Monocular 3D reconstruction of deformable objects, such as human body pa...
Learning latent representations of registered meshes is useful for many ...
Telecommunication with photorealistic avatars in virtual or augmented re...
3D Morphable Models (3DMMs) are statistical models that represent facial...
Face is one of the most important things for communication with the worl...
We present an algorithm that takes a single frame of a person's face fro...
Convolutional Mesh Autoencoders for Generating 3D Faces
The human face is highly variable in shape as it is affected by many factors such as age, sex, ethnicity, etc., and deforms significantly with expressions. The existing state of the art 3D face representations mostly use linear transformations[41, 28, 42] or higher-order tensor generalizations [46, 12, 14]
. These 3D face models have several applications including face recognition, generating and animating faces  and monocular 3D face reconstruction . Since these models are linear, they do not capture the non-linear deformations due to extreme facial expressions. These expressions are crucial to capture the realism of a 3D face.
Meanwhile, convolutional neural networks (CNNs) have emerged as rich models for generating images[22, 35], audio , etc. One of the reasons for their success is attributed to the multi-scale hierarchical structure of CNNs that allows them to learn translational-invariant localized features. Recent works have explored volumetric convolutions  for 3D representations. However, volumetric operations require a lot of memory and have been limited to low resolution 3D volumes. Modeling convolutions on 3D meshes can be memory efficient and allows for processing high resolution 3D structures. However, CNNs have mostly been successful in Euclidean domains with grid-based structured data and the generalization of CNNs to meshes is not trivial. Extending CNNs to graph structures and meshes has only recently drawn significant attention [11, 17, 10]
. Hierarchical operations in CNNs such as max-pooling and upsampling have not been adapted to meshes. Moreover, training CNNs on 3D facial data is challenging due to the limited size of current 3D datasets. Existing large scale datasets[14, 16, 50, 49, 38] do not contain high resolution extreme facial expressions.
To address these problems, we introduce a Convolutional Mesh Autoencoder (CoMA) with novel mesh sampling operations, which preserve the topological structure of the mesh features at different scales in a neural network. We follow the work of Defferrard et al.  on generalizing the convolution on graphs using fast Chebyshev filters, and use their formulation for convolving over our facial mesh. We perform spectral decomposition of meshes and apply convolutions directly in frequency space. This makes convolutions memory efficient and feasible to process high resolution meshes. We combine the convolutions and sampling operations to construct our model in the form of a Convolutional Mesh Autoencoder. We show that CoMA performs much better than state of the art face models at capturing highly non-linear extreme facial expressions with fewer model parameters. Having fewer parameters in our model makes it more compact, and easier to train. This reduction in parameters is attributed to the locally invariant convolutional filters that can be shared over the mesh surface.
We address the problem of data limitation by capturing 20,466 high resolution meshes with extreme facial expressions in a multi-camera active stereo system. Our dataset spans 12 subjects performing 12 different expressions. The expressions are chosen to be complex and asymmetric, with significant deformation in the facial tissue.
In summary, our work introduces a representation that models variations on the mesh surface using a hierarchical multi-scale approach and can generalize to other 3D mesh processing applications. Our main contributions are: 1) we introduce a Convolutional Mesh Autoencoder consisting of mesh downsampling and mesh upsampling layers with fast localized convolutional filters defined on the mesh surface; 2) we show that our model accurately represents 3D faces in a low-dimensional latent space performing 50% better than a PCA model that is used in state of the art face models such as [41, 28, 1, 7, 47]; 3) our autoencoder uses up to 75% fewer parameters than linear PCA models, while being more accurate in terms of reconstruction error; 4) we show that replacing the expression space of a state of the art face model, FLAME , by CoMA improves its reconstruction accuracy; 5) we show that our model can be used in a variational setting to sample a diversity of facial meshes from a known Gaussian distribution; 6) we provide 20,466 frames of complex 3D head meshes from 12 different subjects for a range of extreme facial expressions along with our code and trained models for research purposes.
Face Representations. Blanz and Vetter  introduced the morphable model
; the first generic representation for 3D faces based on principal component analysis (PCA) to describe facial shape and texture variations. We also refer the reader to Brunton et al. for a comprehensive overview of 3D face representations. To date, the Basel Face Model (BFM) , i.e. the publicly available variant of the morphable model, is the most widely used representation for 3D face shape in a neutral expression. Booth et al.  recently proposed another linear neutral expression 3D face model learned from almost face scans of more diverse subjects.
Representing facial expressions with linear spaces, or higher-order generalizations thereof, remains the state-of-the-art. The linear expression basis vectors are either computed using PCA[1, 7, 28, 41, 47], or are manually defined using linear blendshapes (e.g. [42, 27, 6]). Yang et al.  use multiple PCA models, one per expression, Amberg et al.  combine a neutral shape PCA model with a PCA model on the expression residuals from the neutral shape. A similar model with an additional albedo model was used within the Face2Face framework . The recently published FLAME model  additionally models head rotation, and yaw motion with linear blendskinning and achieves state-of-the-art results. Vlasic et al.  introduce multilinear models, i.e., a higher-order generalization of PCA to model expressive 3D faces. Recently, Fernández et al.  propose an autoencoder with a CNN-based encoder and a multilinear model as a decoder. Opposed to our mesh autoencoder, their encoder operates on depth images rather than directly on meshes. For all these methods, the model parameters globally influence the shape; i.e. each parameter affects all the vertices of the face mesh. Our convolutional mesh autoencoder however models localized variations due to the hierarchical multiscale nature of the convolutions combined with the down- and up-sampling.
To capture localized facial details, Neumann et al.  and Ferrari et al.  use sparse linear models. Brunton et al.  use a hierarchical multiscale approach by computing localized multilinear models on wavelet coefficients. While Brunton et al.  also used a hierarchical multi-scale representation, their method does not use shared parameters across the entire domain. Note that sampling in localized low-dimensional spaces  is difficult due to the locality of the facial features; combinations of localized facial features are unlikely to form plausible global face shapes. One goal of our work is to generate new face meshes by sampling the latent space, thus we design our autoencoder to use a single low-dimensional latent space.
Jackson et al.  use a volumetric face representation in their CNN-based framework. In contrast to existing face representation methods, our mesh autoencoder uses convolutional layers to represent faces with significantly fewer parameters. Since it is defined completely on the mesh space, we do not have memory constraints which affect volumetric convolutional methods for representing 3D models.
Convolutional Networks. Bronstein et al.  give a comprehensive overview of generalizations of CNNs on non-Euclidean domains, including meshes and graphs. Masci et al.  define the first mesh convolutions by locally parameterizing the surface around each point using geodesic polar coordinates, and defining convolutions on the resulting angular bins. In a follow-up work, Boscaini et al.  parametrize local intrinsic patches around each point using anisotropic heat kernels. Monti et al.  introduce -dimensional pseudo-coordinates that define a local system around each point with weight functions. This method resembles the intrinsic mesh convolution of  and  for specific choices of the weight functions. In contrast, Monti el al.  use Gaussian kernels with a trainable mean vector and covariance matrix as weight functions.
Verma et al.  presente dynamic filtering on graphs where the filter weights depend on the inputs. This work does not focus on reducing the dimensionality of graphs or meshes. Yi et al.  also present a spectral CNN for labeling nodes but does not involve any mesh dimensionality reduction. Sinha et al.  and Maron et al.  embed mesh surfaces into planar images to apply conventional CNNs. Sinha et al. use a robust spherical parametrization to project the surface onto an octahedron, which is then cut and unfolded to form a square image. Maron et al.  introduce a conformal mapping from the mesh surface into a flat torus. Litani et al. use graph convolutions for shape completion.
Although, the above methods presented generalizations of convolutions on meshes, they do not use a structure to reduce the meshes to a low dimensional space. Our proposed autoencoder efficiently handles these problems by combining the mesh convolutions with efficient mesh-downsampling and mesh-upsampling operators.
Bruna et al.  propose the first generalization of CNNs on graphs by exploiting the connection of the graph Laplacian and the Fourier basis (see Section 3 for more details). This leads to spectral filters that generalize graph convolutions. Boscaini et al. 
extend this using a windowed Fourier transform to localize in frequency space. Henaff et al.
build upon the work of Bruna et al. by adding a procedure to estimate the structure of the graph. To reduce the computational complexity of the spectral graph convolutions, Defferrard et al.
approximate the spectral filters by truncated Chebyshev poynomials, which avoids explicitly computing the Laplacian eigenvectors, and introduce an efficient pooling operator for graphs. Kipf and Welling simplify this using only first-order Chebyshev polynomials.
However, these graph CNNs are not directly applied to 3D meshes. CoMA uses truncated Chebyshev polynomials  as mesh convolutions. In addition, we define mesh down-sampling and up-sampling layers to obtain a complete mesh autoencoder structure to represent highly complex 3D faces, obtaining state of the art results in 3D face modeling.
We define a 3D facial mesh as a set of vertices and edges, , with vertices that lie in 3D Euclidean space, . The sparse adjacency matrix represents the edge connections, where denotes an edge connecting vertices and , and otherwise. The non-normalized graph Laplacian  is defined as , with the diagonal matrix that represents the degree of each vertex in as .
The Laplacian is diagonalized by the Fourier basis (as is a real symmetric matrix) as , where the columns of are the orthogonal eigenvectors of , and
is a diagonal matrix with the associated real, non-negative eigenvalues. The graph Fourier transform of the mesh vertices is then defined as , and the inverse Fourier transform as .
The convolution operator can be defined in Fourier space as a Hadamard product, . This is computationally expensive with large numbers of vertices, since is not sparse. The problem is addressed by formulating mesh filtering with a kernel using a recursive Chebyshev polynomial [17, 23]. The filter is parametrized as a Chebyshev polynomial of order given by
where is the scaled Laplacian, the parameter is a vector of Chebyshev coefficients, and is the Chebyshev polynomial of order that can be computed recursively as with and . The spectral convolution can then be defined as in 
where computes the feature of . The input has features. The input face mesh has features corresponding to its 3D vertex positions. Each convolutional layer has vectors of Chebyshev coefficients, , as trainable parameters.
In order to capture both global and local context, we seek a hierarchical multi-scale representation of the mesh. This allows convolutional kernels to capture local context in the shallow layers and global context in the deeper layers of the network. In order to address this representation problem, we introduce mesh sampling operators that define the down-sampling and up-sampling of a mesh feature in a neural network. A mesh feature with vertices can be represented using a tensor, where is the dimensionality of each vertex. A 3D mesh is represented with . However, applying convolutions to the mesh can result in features with different dimensionality. The mesh sampling operations define a new topological structure at each layer and maintain the context on neighborhood vertices. We now describe our sampling method with an overview as shown in Figure 1.
We perform the in-network down-sampling of a mesh with vertices using transform matrices , and up-sampling using where . The down-sampling is obtained by contracting vertex pairs iteratively that maintain surface error approximations using quadric matrices . In Figure 1(a), the red vertices are contracted during the down-sampling operation. The (blue) vertices after down-sampling are a subset of the original mesh vertices . Each weight denotes whether the -th vertex is kept during down-sampling, , or discarded where .
Since a loss-less down-sampling and up-sampling is not feasible for general surfaces, the up-sampling matrix is built during down-sampling. Vertices retained during down-sampling (blue) undergo convolutional transformations, see Figure 1(c). These (blue) vertices are retained during up-sampling iff Vertices discarded during down-sampling (red vertices) where , are mapped into the down-sampled mesh surface using barycentric coordinates. As shown in Figures 1(b)-1(d), this is done by projecting into the closest triangle in the down-sampled mesh, denoted by , and computing the barycentric coordinates, , such that and . The weights are then updated in as , , and , and otherwise. The up-sampled mesh with vertices is obtained using sparse matrix multiplication, .
|Layer||Input Size||Output Size|
|Layer||Input Size||Output Size|
Network Architecture. Our autoencoder consists of an encoder and a decoder. The structure of the encoder is shown in Table 2. The encoder consists of 4 Chebyshev convolutional filters with
Chebyshev polynomials. Each of the convolutions is followed by a biased ReLU. The down-sampling layers are interleaved between convolutional layers. Each of the down-sampling layers reduce the number of mesh vertices by approximately 4 times. The encoder transforms the face mesh from to an 8 dimensional latent vector using a fully connected layer at the end.
The structure of the decoder is shown in Table 2. The decoder similarly consists of a fully connected layer that transforms the latent vector from to that can be further up-sampled to reconstruct the mesh. Following the decoder’s fully connected layer, 4 convolutional layers with interleaved up-sampling layers generate a 3D mesh in . Each of the convolutions is followed by a biased ReLU similar to the encoder network. Each up-sampling layer increases the numbers of vertices by approximately 4 times. Figure 2 shows the complete structure of our mesh autoencoder.
We train our autoencoder for 300 epochs with a learning rate of 8e-3 and a learning rate decay of 0.99 every epoch. We use stochastic gradient descent with a momentum of 0.9 to optimize the L1 loss between predicted mesh vertices and the ground truth samples. We use L1 regularization on the weights of the network using weight decay of 5e-4. The convolutions use Chebyshev filtering with.
In this section, we evaluate the effectiveness of CoMA on an extreme facial expression dataset. We demonstrate that CoMA allows the synthesis of new expressive faces by sampling from the latent space in Section 5.2, including the effect of adding variational loss. Following, we compare CoMA to the widely used PCA representation for reconstructing expressive 3D faces. For this, we evaluate in Section 5.3
the ability to reconstruct data similar to the training data (interpolation experiment), and the ability to reconstruct expressions not seen during training (extrapolation experiment). Finally, in Section5.4, we show improved performance by replacing the expression space of state of the art face model, FLAME  with our autoencoder.
Our dataset consists of 12 classes of extreme expressions from 12 different subjects. These expressions are complex and asymmetric. The expression sequences in our dataset are – bareteeth, cheeks in, eyebrow, high smile, lips back, lips up, mouth down, mouth extreme, mouth middle, mouth side and mouth up. We show samples from our dataset and the number of frames of each captured sequence in the Supplementary Material.
The data is captured at 60fps with a multi-camera active stereo system (3dMD LLC, Atlanta) with six stereo camera pairs, five speckle projectors, and six color cameras. Our dataset contains 20,466 3D Meshes, each with about 120,000 vertices. The data is pre-processed using a sequential mesh registration method  to reduce the data dimensionality to 5023 vertices.
Let be the encoder and be the decoder.
We first encode a face mesh from our test set in the latent space to obtain features . We then vary each of the components of the latent vector as . We then use the decoder to transform the latent vector into a reconstructed mesh . In Figure 3, we show a diversity of face meshes sampled from the latent space. Here, we extend or contract the latent vector along different dimensions by a factor of 0.3 such that , where is the step. In Figure 3, , and the mean face is shown in the middle of the row. More examples are shown in the Supplementary Material.
Variational Convolutional Mesh Autoencoder. Although 3D faces can be sampled from our convolutional mesh autoencoder, the distribution of the latent space is not known. Therefore, sampling requires a mesh to be encoded in that space. In order to constrain the distribution of the latent space, we add a variational loss on our model. Let be the encoder, be the decoder, and be the latent representation of face . We minimize the loss,
where weights the divergence loss. The first term minimizes the L1 reconstruction error, and the second term enforces a unit Gaussian prior with zero mean on the distribution of latent vectors . This enforces the latent space to be a multivariate Gaussian. In Figure 4, we show visualizations by sampling faces from a Gaussian distribution on this space within , where
, is the variance of the Gaussian prior. We compare the visualizations by setting. We observe that does not enforce any Gaussian prior on , and therefore sampling with Gaussian noise from this distribution results in limited diversity in face meshes. We show more examples in the Supplementary Material.
Several face models use PCA space to represent identity and expression variations [41, 28, 1, 7, 47]. We perform interpolation and extrapolation experiments to evaluate our performance. We use Scikit-learn  to compute PCA coefficients. We consistently use an 8-dimensional latent space to encode the face mesh using both the PCA model and Mesh Autoencoder.
In order to evaluate the interpolation capability of the autoencoder, we split the dataset in training and test samples with a ratio of 9:1. The test samples are obtained by picking consecutive frames of length 10 uniformly at random across the sequences. We train CoMA for 300 epochs and evaluate it on the test set. We use Euclidean distance for comparison with the PCA method. The mean error with standard deviation, and median errors are shown in Table3 for comparison.
|Mean Error||Median Error||# Parameters|
|Mesh Autoencoder||0.845 0.994||0.496||33,856|
We observe that our reconstruction error is 50% lower than PCA. At the same time, the number of parameters in CoMA is about 75% fewer than the PCA model as shown in Table 3. Visual inspection of our qualitative results in Figure 6 shows that our reconstructions are more realistic and are effective in capturing extreme facial expressions. We also show the histogram of cumulative errors in Figure 5a. We observe that our Mesh Autoencoder (CoMA) has about 72.6% of the vertices within a Euclidean error of 1 mm, as compared to 47.3% for the PCA model.
|Mesh Autoencoder||PCA||FLAME |
|Sequence||Mean Error||Median||Mean Error||Median||Mean Error||Median|
Extrapolation Experiment. To measure generalization of our model, we compare the performance of CoMA with the PCA model and FLAME . For comparison, we train the expression model of FLAME on our dataset. The FLAME reconstructions are obtained with latent vector size of 16 with 8 components each for encoding identity and expression. The latent vectors encoded using the PCA model and Mesh autoencoder have a size of 8.
To evaluate generalization capability of our model, we reconstruct the expressions that are completely unseen by our model. We perform 12 different experiments for evaluation. For each experiment, we split our dataset by completely excluding one expression set from all the subjects of the dataset. We test our Mesh Autoencoder on the excluded expression. We compare the performance of our model with PCA and FLAME using the Euclidean distance (mean, standard deviation, median). We perform 12 fold cross validation, one for each expression as shown in Table 4. In Table 4, we also show that our model performs better than PCA and FLAME  on all expression sequences. We show the qualitative results in Figure 7. We show the cumulative Euclidean error histogram in Figure 5b. For a 1 mm accuracy, Mesh Autoencoder captures 63.8% of the vertices while the PCA model captures 45%.
FLAME  is a state of the art model for face representation that combines linear blendskinning for head and jaw motion with linear PCA spaces to represent identity and expression shape variations. To improve the reconstruction error of FLAME, we replace the PCA expression space of FLAME with our autoencoder, and refer to the new model as DeepFLAME. We compare the performance of DeepFLAME with FLAME by varying the size of the latent vector for encoding. Head rotations are factored out for comparison since they are well modeled by linear blendskinning in FLAME, and we consider only the expression space. The reconstruction accuracy is measured using Euclidean distance metric. We show the comparisons in Table 5. The median reconstruction of DeepFLAME is lower for all chosen latent space dimensions, while the mean reconstruction error is lower for up to 12 latent variables. This shows that DeepFLAME provides a more compact face representation; i.e., captures more shape variation with fewer latent variables.
|#dim of||Mean Error||Median||Mean Error||Median|
The focus of CoMA is to model facial shape for reconstruction applications. The Laplace-Beltrami operator (LBo) describes the intrinsic surface geometry and is invariant under isometric surface deformations. This isometry invariance of the LBo is beneficial for shape matching and registration. Since changes in facial expression are near isometric deformations [9, Section 13.3], applying LBo to expressive faces would result in a loss of most expression-related shape variations, making it infeasible to model such variations. The graph Laplacian used by CoMA in contrast to the LBo is not isometry invariant.
While we evaluate CoMA on face shapes, it is applicable to any class of objects. Similar to existing statistical models however, it requires all meshes in dense vertex correspondence; i.e. all meshes need to share the same topology. A future research direction is to directly learn a 3D face representation from raw 3D face scans or 2D images without requiring vertex correspondence.
As is also true for other deep learning based models, the performance of CoMA could further improve with more training data. The amount of existing 3D face data however is very limited. The data scarcity especially limits our expression model to outperform existing models for higher latent space dimensions (see Table 5). We predict superior quality on larger datasets and plan to evaluate CoMA on significantly more data in the future.
As CoMA is an end-to-end trained model, it could also be combined with some existing image convolutional network to regress the 3D face shape from 2D images. We will explore this in future work.
We have introduced CoMA, a new representation for 3D faces of varying shape and expression. We designed CoMA as a hierarchical, multi-scale representation to capture global and local shape and expression variations of multiple scales. To do so, we introduce novel sampling operations and combine these with fast graph convolutions in an autoencoder network. The locally invariant filters, shared across the mesh surface, significantly reduce the number of filter parameters in the network, and the non-linear activation functions capture extreme facial expressions. We evaluated CoMA on a dataset of extreme 3D facial expressions that we will make publicly available for research purposes along with the trained model. We showed that CoMA significantly outperforms state-of-the-art models in 3D face reconstruction applications while usingfewer model parameters. CoMA outperforms the linear PCA model by on interpolation experiments and generalizes better on completely unseen facial expressions. We further demonstrated that CoMA in a variational setting allows us to synthesize new expressive faces by sampling the latent space.
We thank Tsvetelina Alexiadis and Jorge Márquez for data aquisition. We thank Haven Feng for rendering the figures. We acknowledge the advice of Stefanie Wuhrer on mesh convolutions. We are grateful to Georgios Pavlakos, Despoina Paschalidou and Sergi Pujades for helping us with several revisions of the paper.
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Fourteenth International Conference on Artificial Intelligence and Statistics (2011)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research12, 2825–2830 (2011)
Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: Closing the gap to human-level performance in face verification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1701–1708 (2014)
We capture 3D sequences of 12 subjects of different age groups, each of whom perform 12 different expressions. These expressions are chosen such that they are extreme causing a lot of facial tissue deformation. We also make sure that no two expressions are correlated with each other. The number of frames for each expression is listed in Table 6.