1 Introduction
Over the past few years deep Convolutional Neural Networks (CNNs) have emerged as the method of choice for the majority of computer vision tasks that require learning from data
[40, 16, 32]. While the initial use of CNNs was mainly limited to classification/segmentation tasks [16, 32], the introduction of Generative Adversarial Networks (GANs) [27] has expanded the application of deep convolutional architectures to image generation [37, 3, 6, 56]and imagetoimage translation and completion
[35, 67, 18]. Recently, strikingly realistic results have been shown by Nvidia using progressive GANs [37].Given the success of generative models in images, certainly there is a keen interest in replicating them for geometric data. In order to make convolutions/deconvolutions feasible, the current generative approaches still rely on crude shape approximations. For example, recent approaches either use discrete volumetric representations for the 3D shapes, which result in very lowquality shapes [65], or they apply 1D convolutions combined with fully connected layers [1] which do not take into account the local structure of 3D shapes.
Recently, the field of geometric deep learning on nonEuclidean (graph and manifold) structured data has gained popularity
[11], with numerous works on generalizing convolutional architectures directly on meshes. Intrinsic generative models are currently a key open question in geometric deep learning. Intrinsic autoencoder architectures have recently been proposed for human body [45] and face [57] meshes. Nevertheless, due to the lack of appropriate adversarial training, these autoencoders retain only the lowpass shape information and lose most of the details. Furthermore, contrary to GANs, they do not offer a principled sampling strategy. As of today, we are not aware of any successful intrinsic GAN for 3D mesh generation.In this paper, we try to bridge this gap with the following contributions:

We present the first GAN architecture for 3D face generation. Contrary to the autoencoder recently proposed in [57] that learns latent spaces where identity and expression are mixed, we can generate expression for arbitrary identities.

We conduct quantitative and qualitative experiments to verify the efficacy and effectiveness of MeshGAN on large scale 3D facial data.
2 Related Work
2.1 Geometric deep learning
Geometric Deep Learning (GDL) is an emerging field in machine learning attempting to generalize modern deep learning architectures (such as convolutional neural networks) and the underpinning mathematical principles to nonEuclidean domains such as graphs and manifolds (for a comprehensive survey, the reader is referred to the recent review papers
[11, 31, 5]).First formulations of neural networks on graphs [28, 58]
preceding the recent renaissance of deep learning, constructed learnable information diffusion processes. This approach has more recently been reformulated using modern tools such as gated recurrent units
[43] and neural message passing [26]. Bruna et al. [12, 33]proposed formulating convolutionlike operations in the spectral domain defined by the eigenvectors of the Laplacian graph. One of the key drawbacks of this approach leading to high computational complexity is the necessity to explicitly perform the Laplacian eigendecomposition. However, if the spectral filter function can be expressed in terms of simple operations (scalar and matrix multiplications, additions, and inversions), it can be applied directly to the Laplacian avoiding its explicit eigendecomposition altogether. Notable instances of this approach include ChebNets
[22, 39] (using polynomial functions) and CayleyNets [41] (using rational functions); it is possible to generalize these methods to multiple graphs [50] and directed motifbased graph Laplacians [51] using multivariate polynomials.Another class of graph CNNs are spatial methods, operating on local neighborhoods on the domain [23, 49, 4, 30, 63]. For meshes, the first such architecture (GCNN) used local geodesically polar charts [47]; alternative constructions were proposed using anisotropic diffusion (ACNN) [10] and learnable Gaussian kernels (MoNet) [49]. SplineCNN [24] uses Bspline kernels instead of Gaussians, offering significant speed advantage. FeastNet [64] uses an attentionlike softassignment mechanism to establish the correspondence between the patch and the filter. Finally, [44] proposed constructing patch operators using spiral ordering of neighbor pixels.
The majority of the aforementioned works focus on extracting features on nonEuclidean data (e.g., graphs, meshes, and etc.) for classification purposes and limited work has been done towards training generative models. One of the fundamental differences between classical Euclidean generative models (such as autoencoders [38] or Generative Adversarial Networks (GANs) [27]) is the lack of canonical order between the input and the output graph, thus introducing some kind of graph correspondence problem to be solved. In this paper, we deal with the problem of 3D mesh generation and representation on a fixed topology. The setting of fixed topology is currently being studied in computer vision and graphics applications and is significantly easier, since it is assumed that the mesh is given and the vertices are canonically ordered; the generation problem thus amounts only to determining the embedding of the mesh.
The first intrinsic convolutional autoencoder architecture on meshes (MeshVAE) was shown in
[45]. The authors used convolutional operators from [64] and showed examples of human body shape completion from partial scans. A followup work CoMA [57] used a similar architecture with spectral Chebyshev filters [22] and additional spatial pooling to generate 3D facial meshes. The authors claim that CoMA can represent better faces with expressions than PCA in a very small dimensional latent space of only eight dimensions. In this paper, we present the first GANs structure for generating meshes of 3D faces with fixed topology.2.2 Generative adversarial networks
GANs are a promising unsupervised machine learning methodology implemented by a system of two deep neural networks competing against each other in a zerosum game framework [27]. GANs has become hugely popular owing to their capability of modeling the distribution of visual data and generating new instances that have many realistic characteristics (i.e., preserving the highfrequency details) and look authentic to human observers. Currently, GANs are among the top choices to generate visual data and they are preferable to autoencoders and VAEs [46].
Nevertheless, the original GANs were criticized for being difficult to train and prone to mode collapse. Different GANs were proposed to tackle these problems. Wasserstein GAN (WGAN) [3]
proposed a new loss function using Wasserstein distance to stabilize the training. In continuation of WGAN, Gulrajani
et al. [29] proposed an alternative way to clip weights, which helped to improve the training convergence and generation quality. Boundary Equilibrium GANs (BEGANs) [6] implemented the discriminator as an autoencoder whose loss is derived from Wasserstein distance. In that, an equilibrium enforcing method was proposed to balance the training of generator and discriminator. Chang et al. [15] further proposed a variant of BEGAN with a Constrained Space (BEGANCS). They tried to improve the training stability by adding a latentspace constraint in the loss function. As BEGAN has demonstrated good performance in generating photorealistic faces, following BEGAN, we meticulously design a generative network for realistic generation of 3D faces.2.3 3D Facial shape representation and generation
For the past two decades, the method of choice for representing and generating 3D faces is still Principal Component Analysis (PCA). PCA was used for building statistical 3D shape model (
i.e., 3D Morphable Models (3DMMs)) in many works [54, 53, 7]. Recently, PCA is adopted for building large scale statistical models of the 3D face [9] and head [21]. It is very convenient for representing and generating faces to decouple facial identity variations from expression variations. Hence, statistical blendshape models have been introduced which represent only the expression variations using PCA [42, 52] or multilinear methods [13, 8]. Some recent efforts were made to represent facial expressions with deep learning using fully connected layers [60, 62]. Fully connected layers have huge number of parameters and also do not take into account the local geometric of the 3D facial surfaces. The only method that represented faces using convolutions on the mesh domain was the recently proposed mesh autoencoder CoMA [57]. Nevertheless, the identity and expression latent space of CoMA was mixed. Furthermore, the representative power and expressiveness of the model is somewhat limited because it was trained on only 12 subjects displaying 12 classes of extreme expressions. In this paper, we train deep generative graph convolutional neural networks (DGCNs) using spectral mesh convolutions that individually model identity and expression on large scale data.3 Proposed Approach
In this part, we define the mesh convolution operators, describe our encoder and decoder/generator and layout our MeshGAN architecture for nonlinear generation of 3D faces.
3.1 Data representation
We represent the facial surface as a manifold triangular mesh where each edge belongs to at most two triangle faces and (here, we denote by and the interior and boundary edges, respectively). An embedding of is realised by assigning 3D coordinates to the vertices , which are encoded as a matrix containing the vertex coordinates as rows. The discrete Riemannian metric is defined by assigning a length to each edge .
The Laplacian operator is discretised (using the distancebased equivalent of the cotangent formula [36, 48]) as an matrix , where is a diagonal matrix of local area elements , and is a symmetric matrix of edgewise weights, defined in terms of the discrete metric:
The Laplacian admits an eigen decomposition with orthonormal eigenvectors
and nonnegative eigenvalues
arranged into a diagonal matrix .3.2 Spectral mesh convolutions
Let be a scalar real function defined on the vertices of the mesh, represented as an
dimensional vector. The space of such functions is a Hilbert space with the standard inner product
. The eigenvectors of the Laplacian form an orthonormal basis in the aforementioned Hilbert space, allowing a Fourier decomposition of the form , where is the Fourier transform of . The Laplacian eigenvectors thus play the role of standard Fourier atoms and the corresponding eigenvalues that of the respective frequencies. Finally, a convolution operation can be defined in the spectral domain by analogy to the Euclidean case as .Spectral graph CNNs. Bruna et al. [12] exploited the above formulation for designing graph convolutional neural networks, in which a basic spectral convolution operation has the form , where is a diagonal matrix of spectral multipliers representing the filter and is the filter output. Among notable drawbacks of this architecture putting it at a clear disadvantage compared to classical Euclidean CNNs are: high computational complexity ( due to the cost of computing the forward and inverse graph Fourier transform, incurring dense matrix multiplication), parameters per layer, and no guarantee of spatial localization of the filters.
ChebNet. Defferrard et al. [22] considered the spectral CNN framework with polynomial filters represented in the Chebyshev basis, , where denotes the Chebyshev polynomial of degree , with and . A single filter of this form can be efficiently computed by applying powers of the Laplacian to the feature vector,
(1) 
thus avoiding its eigendecomposition altogether. Here is a frequency rescaled in , is the rescaled Laplacian with eigenvalues . The computational complexity thus drops from as in the case of spectral CNNs to , since the mesh is sparsely connected.
3.3 MeshGAN
We introduce MeshGAN, a variant of BEGAN [6], that can learn a nonlinear 3DMM directly from the 3D meshes. Specifically, we employ the aforementioned ChebNet to build our discriminator and generator .
3.3.1 Boundary equilibrium generative adversarial networks
The main difference between BEGAN and typical GANs is that, BEGAN uses an autoencoder as the discriminator, as it tries to match the autoencoder loss distribution rather than the data distributions. This is achieved by adding an extra equilibrium term . More precisely, this hyperparameter is used to maintain the balance of the loss expectation of discriminator and generator (i.e., ). The training objective of BEGAN is as follows:
where is the uniform random vector of dimension (aka. the latent vector of generator), and are the trainable parameters of the discriminator and generator respectively; is the discriminator loss, for which we select loss in this paper. In each training step , variable is utilised to control the influence of the fake loss on discriminator; can be regarded as the learning rate of , which is set to 0.001. Berthelot et al. [6] found out that has a decisive impact on the diversity of generated images, that is, lower values tends to produce mean facealike images. To encourage more variations, we empirically set to 0.7.
3.3.2 MeshGAN architecture
Based on the architecture of BEGAN, we developed MeshGAN using ChebNet [22, 39]. The architecture of MeshGAN is illustrated in Fig. 3. We follow a similar design of CoMA for building our encoder and generator/decoder, 4 Chebyshev convolutional filters with = 6 polynomials are used in the encoder. Nevertheless, after each convolution layer, we select ELU [19]
as the activation function to allow the passing of negative values. The mesh downsampling step is performed by the surface simplification method in
[25], which minimises the quadric error when decimating the template. Upsampling of the template is based on the barycentric coordinates of the contracted vertices in decimated mesh [57]. In total, we perform 4 levels of downsampling, with each level lowering the number of vertices by approximately 4 times. To allow for more representation powers, we set the bottleneck of discriminator to be 64, equal to the dimension of feature embedding in generator. Momentum optimizer [55] is employed, with the learning rate being and decay rate being. We train all the models with 300 epochs. Note that skip connections between the output of fully connected layer and each upsampled graph can be applied to encourage more facial details.
4 Experiments
4.1 3D face databases
3dMD: For identity model training, we used recently collected 3dMD datasets scanned by the high resolution 3dMD device^{1}^{1}1http://www.3dmd.com/. We selected around 12,000 unique identities from this database, with different ethnic groups (i.e., Chinese, Caucasian, Black people) and age groups presented.
4DFAB: To train expression models, we use the 4DFAB database [17], which is the largest dynamic 3D face database that contains both posed and spontaneous expressions. In 4DFAB, participants were invited to attend four experiment sessions at different times. In each session, participants were asked to articulate 6 basic facial expressions, and then watched several emotional videos. Annotation of apex posed expression frames as well as the expression category of spontaneous sequences were provided. To ensure the richness of expressions in our training set, we randomly sampled 6,651 apex posed expression meshes and 7,567 spontaneous expression meshes from 4DFAB.
For each database, we train the CoMA and MeshGAN model with the corresponding data. We label the models that are trained on 3dMD database with ID, whereas the models trained on 4DFAB database are appended with EXP.
4.1.1 Data preprocessing
To balance the fineness and complexity of model, we cropped and decimated the LSFM model [9], and generated a new 3D template with 5,036 vertices. In order to bring all the data into dense correspondence with the template, we employed Nonrigid ICP [2] to register each mesh. We automatically detected 79 3D facial landmarks with the UVbased alignment method developed in [17], and utilised these landmarks to assist dense registration. Unless otherwise stated, we divided each database into training and testing sets with a split ratio of 9:1.
On a separate note, in order to train the expression models, we need to decouple facial identity from every expression mesh in 4DFAB. This was achieved by manually selecting one neutral face per subject per session in 4DFAB, and subtracting the expression mesh with its corresponding neutral face to obtain the facial deformation. We then exerted this deformation on the 3D template to generate a training set with pure expressions. Note that a local surfacepreserving smoothing step [61] was undertaken to further remove identity information as well as noises.
Methods  Generalisation  Specificity  FID 
CoMAID  0.4420.116  1.600.228  14.24 
MeshGANID  0.4650.189  1.4330.144  10.82 
Methods  Generalisation  Specificity  FID 
CoMAEXP  0.6060.203  1.8990.272  22.43 
MeshGANEXP  0.6050.264  1.5360.153  13.59 
4.2 Intrinsic evaluation of MeshGAN
We gave a quantitative evaluation of MeshGAN’s generator, whose counterpart is the decoder of CoMA. The intrinsic characteristics of the models include generalisation capability, specificity [14, 8], as well as FID score [34].
Generalisation. The generalisation measures the ability of a model to represent/reconstruct unseen face shapes that are not present during training. To compute the generalisation error, we computed the pervertex Euclidean distance between every sample of the test set and its corresponding reconstruction by the generator :
(2) 
After that, we took the average value over all vertices and all test samples. This procedure was conducted separately on identity and expression models. We reported the mean and standard deviation of the reconstruction errors in Table
1 and Table 2. It can be seen that both methods achieved similar performance in reconstructing facial expressions (MeshGANID achieved 0.605mm, while CoMAID produced 0.606mm), whereas CoMA is slightly better in describing unseen identity (0.023mm lower in error). This is probably attributed to the fact that autoencoder is specifically trained to reconstruction data examples, while BEGAN is not. We leave this as our future investigation, and refer the readers to
[20, 66].Specificity. The specificity of a model evaluates the validity of generated faces. For each model, we randomly synthesised 10,000 faces and measured the proximity between them and the real faces in test set. More precisely, for every randomly generated face, we found its nearest neighbor in the test set, in terms of minimum (over all samples of the test set) of the average pervertex distance. We recorded the mean and standard deviation of this distance over all random samples as the specificity error
. Note that we randomly sampled MeshGAN with the uniform distribution
, whereas we facilitated CoMA with a multivariate Gaussian distribution
estimated from the features embedding of the training data in CoMA (using Eq. 2). Table 1 and Table 2 also display the specificity errors for different models. We observed that in all the cases, MeshGAN attained particularly low errors against CoMA, i.e., 0.17mm lower in identity, 0.36mm lower in expression. This is a quantitative evidence that the synthetic faces generated by MeshGAN models are more realistic than those of CoMA.Fréchet Inception Distance (FID). FID [34] is a reliable measurement on the quality and diversity of the images generated by GANs. To compute FID score, we borrowed the pretrained Inception network [59] to extract features from an intermediate layer and then modelled the distribution of these features using a multivariate Gaussian . As Inception network is trained on 2D images, we rasterised each 3D mesh (with lambertian shading) into a 6464 image and fed it to the network. The FID score between the real images and generated images is computed as:
where and are the multivariate Gaussians estimated from the inception feature of the real and generated images respectively. The smaller the FID values are, the better the image quality and diversity would be. It has to be mentioned that, when sampling the latent space of CoMA, we did not estimate the multivariate Gaussian beforehand, as the training data distribution is not supposed to be revealed here. Hence, we used a standard Gaussian to sample latent space of CoMA, meanwhile for the MeshGAN, we always use the uniform distribution . We show the FID scores of CoMA and MeshGAN in Table 1 and Table 2. We can observe that FID scores of MeshGAN are significantly lower than those of CoMA in both cases. This is another strong evidence that MeshGAN can generate meshes with richer variations and better quality than autoencoders.
As a matter of fact, we also experimented with different GANs (such as the original GANs [27], WGAN [3] and BEGANCS [15]) in the same architectures as MeshGAN. Unfortunately, they did not achieve any comparable performances with CoMA or BEGAN. Due to limited space, we put this ablation study in the supplementary material.
4.3 Extrapolating identity and expression model
We first extrapolated the latent vector of the identity model and visualised the exaggerated synthetic examples. Given a pair of meshes and , we estimated the feature embedding (denoted as ) using Eq. 2. After that, we computed the extrapolated latent vector using a nonconvex combination of two vectors and :
(3) 
Here, we fixed mesh to be the neutral template, while was the target face reconstructed by MeshGAN and CoMA, separately. Fig. 4 shows the extrapolation results of the identity model in terms of ethnicity and age (note that we increased from to ). We can clearly observe that: (a) MeshGAN can better describe the subtle facial details (e.g., eyes and lips); (b) CoMA produces highly distorted and grotesque faces (e.g., disproportionate nose, incorrect exaggeration of ethnicity and age) as the extrapolation proceeds, whereas MeshGAN did not have such issues.
For the extrapolation of expression models, we followed the same approach and showed the results in Fig. 5. Obviously, MeshGAN is more capable of representing different facial expressions, especially the facial muscle movement (e.g
., disgust in the first row). Compared with CoMA, the exaggerated expressions from MeshGAN are still quite meaningful and realistic. To quantitatively evaluate the semantic correctness of exaggerated expressions, we trained a 3D expression classifiers using SplineCNN
[24]. We built this FER network with 4 convolution layers: SConv(,1,16)Pool(4)SConv(,16,16)Pool(4)SConv(,16,16)Pool(4)SConv(,16,32)Pool(4)
FC(6), where are the Bspline kernel sizes. ELU [19] is used after each convolution and fully connected layer. We trained the network with 80 epochs, learning rate and epoch size equal to 0.0001 and 16, respectively. The Pool() operation is exactly the same as MeshGAN. For FER training, we prepared around 6k posed expression meshes (6 expressions, each has nearly 1k samples) from 4DFAB, which are not present in training set of expression model. We testified the exaggerated expressions produced by different extrapolating factor (ranged from 1 to 3). We plotted the recognition rate for each as a curve in Fig. 6. Interestingly, as the degree of extrapolation increases, the recognition rate for CoMA drastically declines, while MeshGAN decreases comparatively slowly. This further proves that MeshGAN can still provide meaningful expressions even when sampling beyond the normal range.
4.4 Qualitative results
We used the pipeline in Fig. 3 to generate 3D identities with expressions. Qualitative results are shown in Fig. 1 (b). To visualise the interpolation and extrapolation between/beyond two faces, we synthesised two identities with different expression and used them as the anchor faces. Following Eq. 3, we varied the parameters of identity and expression models by separate factors and . By using the grid of interpolated/extrapolated parameters, we synthesised the corresponding faces and displayed them in Fig. 7.
5 Conclusion
We presented the first GANs capable of generating 3D facial meshes of different identities and different expressions. We have experimentally and empirically demonstrated that the proposed MeshGAN can generate 3D facial meshes with more subtle details than the stateoftheart autoencoders. Finally, we show that the proposed MeshGAN can model the distribution of faces better than autoencoders, hence it leads to better sampling strategies.
References
 [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3d point clouds. In ICLR Workshop, 2018.
 [2] B. Amberg, S. Romdhani, and T. Vetter. Optimal step nonrigid icp algorithms for surface registration. In CVPR, pages 1–8. IEEE, 2007.
 [3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [4] J. Atwood and D. Towsley. Diffusionconvolutional neural networks. In NIPS, 2016.
 [5] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. SanchezGonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261, 2018.
 [6] D. Berthelot, T. Schumm, and L. Metz. Began: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
 [7] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In SIGGRAPH, pages 187–194, 1999.
 [8] T. Bolkart and S. Wuhrer. A groupwise multilinear correspondence optimization for 3d faces. In ICCV, pages 3604–3612, 2015.
 [9] J. Booth, A. Roussos, A. Ponniah, D. Dunaway, and S. Zafeiriou. Large scale 3d morphable models. IJCV, 126(24):233–254, 2018.
 [10] D. Boscaini, J. Masci, E. Rodolà, and M. M. Bronstein. Learning shape correspondence with anisotropic convolutional neural networks. In NIPS, 2016.
 [11] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
 [12] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. arXiv:1312.6203, 2013.
 [13] A. Brunton, T. Bolkart, and S. Wuhrer. Multilinear wavelets: A statistical shape space for human faces. In ECCV, pages 297–312. Springer, 2014.
 [14] A. Brunton, A. Salazar, T. Bolkart, and S. Wuhrer. Review of statistical shape spaces for 3d data with comparative analysis for human faces. CVIU, 128:1–17, 2014.
 [15] C.C. Chang, C. H. Lin, C.R. Lee, D.C. Juan, W. Wei, and H.T. Chen. Escaping from collapsing modes in a constrained space. In ECCV, 2018.
 [16] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834–848, 2018.
 [17] S. Cheng, I. Kotsia, M. Pantic, and S. Zafeiriou. 4dfab: A large scale 4d database for facial expression analysis and biometric applications. In CVPR, pages 5117–5126, 2018.
 [18] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multidomain imagetoimage translation. arXiv preprint, 1711, 2017.
 [19] D.A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In ICLR, 2016.
 [20] A. Creswell and A. A. Bharath. Inverting the generator of a generative adversarial network (ii). arXiv preprint arXiv:1802.05701, 2018.
 [21] H. Dai, N. Pears, W. A. P. Smith, and C. Duncan. A 3d morphable model of craniofacial shape and texture variation. In ICCV, Oct 2017.
 [22] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, 2016.
 [23] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. AspuruGuzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In NIPS, 2015.
 [24] M. Fey, J. E. Lenssen, F. Weichert, and H. Müller. Splinecnn: Fast geometric deep learning with continuous Bspline kernels. In CVPR, 2018.
 [25] M. Garland and P. S. Heckbert. Surface simplification using quadric error metrics. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 209–216. ACM Press/AddisonWesley Publishing Co., 1997.
 [26] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum chemistry. arXiv:1704.01212, 2017.
 [27] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 [28] M. Gori, G. Monfardini, and F. Scarselli. A new model for learning in graph domains. In IJCNN, 2005.
 [29] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In NIPS, pages 5767–5777, 2017.
 [30] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.
 [31] W. L. Hamilton, R. Ying, and J. Leskovec. Representation learning on graphs: Methods and applications. arXiv:1709.05584, 2017.
 [32] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [33] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graphstructured data. arXiv:1506.05163, 2015.
 [34] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In NIPS, pages 6626–6637, 2017.

[35]
P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros.
Imagetoimage translation with conditional adversarial networks.
arXiv preprint, 2017.  [36] A. Jacobson and O. SorkineHornung. A cotangent laplacian for images as surfaces. Technical report/Department of Computer Science, ETH, Zurich, 757, 2012.
 [37] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.
 [38] D. P. Kingma and M. Welling. Autoencoding variational Bayes. In ICML, 2014.
 [39] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. In ICLP, 2017.
 [40] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436, 2015.
 [41] R. Levie, F. Monti, X. Bresson, and M. M. Bronstein. Cayleynets: Graph convolutional neural networks with complex rational spectral filters. arXiv:1705.07664, 2017.
 [42] T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learning a model of facial shape and expression from 4d scans. ACM TOG, 36(6):194, 2017.
 [43] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel. Gated graph sequence neural networks. In ICLR, 2016.
 [44] I. Lim, A. Dielen, M. Campen, and L. Kobbelt. A simple approach to intrinsic correspondence learning on unstructured 3d meshes. arXiv:1809.06664, 2018.
 [45] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. In CVPR, 2018.
 [46] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are gans created equal? a largescale study. In NIPS, 2018.
 [47] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In 3dRR, 2015.
 [48] M. Meyer, M. Desbrun, P. Schröder, and A. H. Barr. Discrete differentialgeometry operators for triangulated 2manifolds. In Visualization and mathematics III, pages 35–57. Springer, 2003.
 [49] F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M. Bronstein. Geometric deep learning on graphs and manifolds using mixture model CNNs. In CVPR, 2017.
 [50] F. Monti, M. M. Bronstein, and X. Bresson. Geometric matrix completion with recurrent multigraph neural networks. In NIPS, 2017.
 [51] F. Monti, K. Otness, and M. M. Bronstein. Motifnet: a motifbased graph convolutional network for directed graphs. arXiv:1802.01572, 2018.
 [52] T. Neumann, K. Varanasi, S. Wenger, M. Wacker, M. Magnor, and C. Theobalt. Sparse localized deformation components. ACM TOG, 32(6):179, 2013.
 [53] A. Patel and W. A. Smith. 3d morphable face models revisited. In CVPR, pages 1327–1334. IEEE, 2009.

[54]
P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter.
A 3d face model for pose and illumination invariant face recognition.
In AVSS, pages 296–301. Ieee, 2009.  [55] N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
 [56] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [57] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Generating 3d faces using convolutional mesh autoencoders. arXiv:1807.10267, 2018.
 [58] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Trans. Neural Networks, 20(1):61–80, 2009.
 [59] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
 [60] Q. Tan, L. Gao, Y.K. Lai, and S. Xia. Variational autoencoders for deforming 3d mesh models. In CVPR, pages 8377–8386, 2018.
 [61] G. Taubin, T. Zhang, and G. Golub. Optimal surface smoothing as filter design. In ECCV, pages 283–292. Springer, 1996.
 [62] L. Tran and X. Liu. Nonlinear 3d face morphable model. arXiv preprint arXiv:1804.03786, 2018.
 [63] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. arXiv:1710.10903, 2017.
 [64] N. Verma, E. Boyer, and J. Verbeek. Feastnet: Featuresteered graph convolutions for 3d shape analysis. In CVPR, 2018.
 [65] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In NIPS, pages 82–90, 2016.
 [66] J.Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, pages 597–613. Springer, 2016.
 [67] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint, 2017.