The generation of realistic examples of everyday objects is a challenging and interesting problem which relates to several research fields such as geometry, computer graphics, and computer vision. The ability to capture the essence of a class of objects is key to the task of generating diverse datasets which may be used in turn during the training of many machine learning based algorithms. The main challenge posed by the task of data generation is to construct the model that is able to generalize to many variations while still maintaining high detail and quality. Furthermore, the challenge of generating geometric data is even greater since Both geometry and texture of an object must be synthesized while taking into account the underlying relations between them.
In this work, we propose to learn the latent space of 3D textured objects. We focus our efforts on human faces, and show that by using a canonical transformation that maps geometric data to images, we are able to learn the distribution of such images via the GAN framework. By representing both texture and geometry of the face as transformed geometric images, we can learn the underlying distribution of faces, and later generate new faces at will. The generation of realistic human faces is a useful tool with applications in face recognition, puppetry, reconstruction and rendering. Our main contributions are the proposition of a new model for 3D human faces which is composed in the 2D image domain, as well as the modeling of the relation between texture and geometry, further improving realism. By generating geometries and textures using state of the art GANs, it is possible create highly detailed data samples while maintaining the ability go generalize to unseen data, two desirable propertied that are often at odds.
While deep learning and convolutional networks have revolutionized many fields in recent years, they have been mostly employed on structured data which is intrinsically ordered. Arranged data such as audio, video, images, and text can be processed according to the order of samples, frames, pixels or words. This inherent ordering permits the application of convolution operations which are the main building block of convolutional networks, a powerful and popular variant of deep networks. Contrary to typical parameterized data, geometric data represented by two dimensional manifolds lacks an intrinsic parameterization and is therefore more difficult to process via convolutional networks. This important class of data is crucial to the task of modeling our world as most solid objects can be represented by a closed manifold accompanied by a texture overlay.
Recently, geometric data has grown dramatically in availability as more accurate and affordable acquisition devices have come into use. This abundance of data has attracted the attention of the computer vision and machine learning communities, leading to many new approaches for modeling and processing of geometries. One family of techniques for geometric data processing aims to define new operators which can be applied directly to the manifold and are able to replace to some extent the convolution operation within the processing pipeline. Other methods attempt to process geometries in the spectral domain or represent them in voxel space. These families of methods each have their merits but suffer from other issues such as loss of generality and memory inefficiency. In contrast, we propose to transform our geometric data via a canonical mapping into two dimensional gridded data. This allows us to process the geometric data as images. While this approach on its own is not new we show that by careful construction of the transformed dataset we are able to harness the power of convolutional networks with little loss of data fidelity. Furthermore, we are able to design our transformation process in order to control the distortion, thus reducing it in important areas while spreading it to the non essential areas of the data. Finally, we propose to encode both the geometry and texture as mapped images which means the processing pipeline remains identical for both cases.
2. Related work
Data augmentation is a common practice within the machine learning community. By applying various transformations to existing data samples it is possible to simulate a much larger dataset than is available and introduce robustness to transformations. A more advanced method for data augmentation takes into account the geometry of the scene. The technique which we term geometric data augmentation consists of a geometry recovery stage, then transformation is performed on the geometry and finally a new image is created by projecting the geometry. In (Masi et al., 2016), the authors show that by performing geometric data augmentation on a dataset of facial images they are able to reach state of the art results on difficult facial recognition benchmarks. Despite its proven usefulness geometric augmentation still lacks the ability to create completely new data samples outside the scope of the dataset.
A complementary method to data augmentation is data generation. Bu constructing a high quality model for data generation it is possible to produce an infinitely large dataset. In addition, some models may permit control over the characteristics of each data sample. Within the domain of faces this would mean control over parameters such as age, gender, expression, pose and lighting conditions. When dealing with image data a recent popular approach is to use a GAN (Goodfellow et al., 2014)
which is in essence a neural network with a trainable loss function. While this class of methods is well suited for images, reformulation in the context of geometry is more challenging and several competing approaches exist in this field.(Gecer et al., 2018) and (Shrivastava et al., 2017) propose to construct samples from a low quality linear model, and then use a GAN in order enforce the realism of the data. (Litany et al., 2018) and (Ranjan et al., 2018)
both propose the use of convolutional autoencoders which are trained on pre-aligned geometric data. These methods however do not take into account the model texture. In addition(Wu et al., 2016) have used the popular voxel grid representation for geometries, and are able to generate 3D objects using this notion. This method however is memory inefficient and in practice can produce only coarse geometries.
In addition to data augmentation and generation the objective of pose normalization is to decouple the subjects identity from other factors such as expression and pose which may confuse a classifier. This can be either done by geometric reconstruction manipulation of the facial geometry or by performing normalization directly in the image domain. While(Chu et al., 2014) and (Bas et al., 2017) leverage a geometric representation in order to transform the data, (Tran et al., 2017b) and (Huang et al., 2017) are able to frontalize faces directly in the image domain as part of their pipeline. Although useful methods which help the training process by limiting data variation, these methods still do not explicitly model new data samples which is our ultimate goal.
An additional method for geometrically manipulating facial data which has gained success is geometric reconstruction from a single image. One popular family of methods aim to fit a parametric model to an image. This idea was first introduced by(Blanz and Vetter, 1999) and has since been extended by works such as (Booth et al., 2017). An approach which involves regressing the coefficients of a given model via a deep network were popularized by (Richardson et al., 2016) and extended by (Richardson et al., 2017) and (Tran et al., 2017a). More recently methods which are not restricted to a specific model or attempt to learn the model during training time such as (Sela et al., 2017), (Tewari et al., 2018) and (Tran and Liu, 2018) have been able leave the restricting assumptions of linear models such as 3DMM. Complementary efforts such as (Deng et al., 2018) propose to reconstruct occluded texture regions in order to gain a full textured reconstruction from challenging poses as well. Another recent work by (Saito et al., 2017) focuses on improving the quality of facial texture used in reconstructed faces in order to improve realism. An additional complimentary approach proposed by (Riza Alp Güler, 2016) is to learn a direct mapping from an image to a template model. All of the above approaches while useful, are based on fitting some geometry to a given image by relying on some underlying geometric model. This model however is not explicitly used in order to generate novel faces but rather to reconstruct existing ones.
Our most direct competition comes from several works in the field of facial generative modeling. The seminal work by (Blanz and Vetter, 1999)
which pioneered the field almost two decades ago is still widely used within many methods, some of which were mentioned above. The linear 3D Morphable Model proposed is extremely flexible; however it has the drawback of using a small number of PCA vectors which limit its ability to present highly detailed models. A recent large scale effort taken by(Booth et al., 2016) and (Booth et al., 2018) has produced the largest publicly known 3DMM by scanning subjects and using their scans to construct the model. In contrast to linear models much more complex relations can be captured by training deep networks to take the part of data generators. To this end, (Tewari et al., 2018) and (Tran and Liu, 2018) were able to jointly learn a reconstruction encoder while also learning the facial model itself. Given the trained model one could plausibly generate faces, however the authors have not shown any experiments to this effect. (Ranjan et al., 2018) on the other hand has employed mesh autoencoders to construct new facial geometries, however this method does not produce texture and was trained on a limited dataset of very few subjects. In this work we will propose a new GAN based facial geometric generative model, and analyze the ability of our model to extend to new identities. We also relate between the geometric and texture models which are intrinsically correlated and discuss different ways of exploiting this correlation for our cause.
3. 3D Morphable Model
One of the early attempts to capture facial geometry and photometry (texture) by a linear low dimensional space is the Blanz and Vetter (Blanz and Vetter, 1999) 3D Morphable Model
(3DMM). Using the 3DMM, textures and geometries of faces can be synthesized as a linear combination of the elements of an orthogonal basis. The basis is constructed from a collection of facial scans and by applying the principal component analysis after alignment of the faces. That is, the basis construction process relies on a vertex to vertex alignment of the facial scans, which is achieved by computationally finding a dense correspondence between each scan to a template model. The aligned vertices provide a set of spatial and texture coordinates which are then decomposed into the principal components of the set. Once the basis is constructed, it is possible to represent each face by projecting it onto the firstcomponents of both the geometry and the texture bases.
This linear model was used to reconstruct 3D faces from 2D images; Blanz and Vetter (Blanz and Vetter, 1999) took an analysis-by-synthesis approach, which attempts to fit a projected surface model embedded in into a given 2D image. This was established by constructing a fully differentiable parametric image formation pipeline, and performing a gradient descent procedure optimizing for an image to image loss on the model parameters. The parameters consist of the geometry and texture models coefficients of the face, as well as the lighting and pose parameters. This process results in a set of coefficients which encode the geometry and texture of any given face up to their projections on the principal components basis, effectively reconstructing the curved surface structure and the photometry of the given image of a face.
3.1. Model Construction
According to the 3DMM model, each face is represented as an ordered set of geometric coordinates and texture coordinates in RGB space
. Given a set of faces, each represented by geometry and texture vectors, construct the matrices and by column wise concatenation of all geometric coordinates and all corresponding texture coordinates. Since the alignment process ensures an ordered universal representation of all faces, Principal Component Analysis (PCA) (Jolliffe, 1986) can be applied to extract the optimal first orthogonal basis components in terms of reconstruction error. To that end, denote by and the matrices that contain the left singular vectors of and , respectively, where and are the average geometry and texture of the faces and is a vector of ones.
By ordering and
according to the magnitude of the singular values in a descending order, the texture and the geometric coordinates of each given face can be approximated by the linear combination
where and are the coefficients vectors, obtained by and . Following this formulation, it is possible to use such a model to generate new faces by randomly selecting the geometry and texture coefficients and plugging them into Equation 1. According to (Blanz and Vetter, 1999)is given by
is a covariance matrix that can be empirically estimated from the data, and is generally assumed to be diagonal.
3.2. Synthesis model
The 3D morphable model is useful not only in the context of representation and reconstruction, but, as noted in the previous section, it also allows for the generation of new faces which can not be found in the training set. The synthesis is achieved by randomizing linear combinations of the basis vectors. The random coefficients are drawn according to the model prior from the distribution described in Equation 2. As is common practice when dealing with principal components, only the first vectors can be taken into account as part of the model. The number can be obtained by analyzing the decay of the singular values which is proportional to the error produced by ignoring the associated basis vector. By excluding the vectors for which the singular variables are sufficiently small we can guarantee minimal loss of data.
Even though two decades have passed since the inception of the 3DMM, it is still widely used in cutting edge applications. By harnessing the generative powers of this model, it has been used as a tool for data augmentation and data creation for training of convolutional networks (Sela et al., 2017; Richardson et al., 2016; Richardson et al., 2017; Gecer et al., 2018). Furthermore, the model has been integrated into deep learning pipelines in order to provide structure and regularization to the learning process (Tewari et al., 2018). In spite of the wide use and apparent success of the model it is clear that the faces obtained from it tend to be over-smoothed and in some cases non-realistic. Furthermore, the multivariate normal distribution model from which the coefficients are drawn is over simplified and does not represent the true distribution of faces. In particular, the texture and geometry are treated as two uncorrelated variables, in contradiction to empirical evidence. Figure 1 shows a few samples of synthesized 3DMM faces and depicts the difference between the distributions of 3DMM generated faces and real ones.
4. Progressive growing GAN
Generation of novel plausible data samples requires learning the underlying distribution of the data. Given a perfect discriminator which can differentiate between real and fake data samples it is possible to construct a training loss for a generator model which tries to maximally confuse the discriminator. For complex realistic data, finding such a discriminator is a difficult problem on its own and requires learning from realistic and fake examples.
The fundamental idea of the GAN framework is to train both of these networks simultaneously. Essentially, this means that we use a trainable loss function for the generator which constantly evolves as the generator improves. This process can be formulated as Equation 3
where are the discriminator and generator parametric functions, are the real data samples and latent representation vector respectively.
Since we wish to produce high resolution textures for facial geometries, we propose to use a recent successful GAN, namely (Karras et al., 2017). The progressive growing GAN is built in levels which gradually increasing the resolution of the output image. During the training process each level is added consecutively while smoothly blending the new levels into the output as they are added. This, and several other techniques were shown to increase the training stability as well as the variation of the produced data.
The difficulty concerning geometric data is that it lacks the regular intrinsic ordering which exists in 2D images, which are essentially large matrices. For this reason, it is unclear how to apply spatial filtering, which is the core building block of neural network layers, to arbitrary geometric structures. Significant progress has been made in this direction by several recent papers. A comprehensive survey is presented in (Bronstein et al., 2017). These methods, however, are not yet widely used and supported within standard deep learning coding libraries. In order to harness the full power of recent state of the art developments in the field, it is sometimes preferable to work in the domain of images. For this reason, we built a data processing pipeline which maps the geometric scanned data into a flat canonical image which allows the utilization of the progressively growing GAN without major modifications.
5. Training data construction
In this section we describe the process by which we produce our training data. We start with digital geometric scans of human faces. By making use of a surface to surface alignment process (Weise et al., 2009), we are able to bring all the scans into correspondence with each other. Next, applying a universal mapping from the mesh to the 2D plane, we can transfer the facial texture into a canonically parametrized image. These aligned texture images are used to train our texture generation model.
We provide several alternatives for constructing the facial geometry which accompanies each texture. One solution is to learn the relation between 3DMM texture and geometry coefficients which is prevalent in the training data. In addition, we can similarly process the geometric data of the faces as well. By applying the same canonical transformation and encoding the coordinates of the model vertices as RGB channels of an image, we can learn to generate geometries as well as textures using the same methodology.
5.1. Face scanning and marking
Our training data formulation process starts by acquiring digital high resolution facial scans. Using a 3DMD scanner, roughly different subjects were scanned, each making five distinct facial expressions including a neutral expression. The subjects were selected to form a balanced mixture of genders and ethnic backgrounds. Each scan is comprised of the facial geometry, represented by a triangulated mesh, as well as two high resolution photographs, which capture a 180 degree view of the subject’s face. Each mesh triangle is automatically mapped to one of the photos, allowing the facial texture to be transferred onto the mesh.
Due to the variety of facial geometries, as well as limitations of the scanning process, the meshes may contain imperfections such as holes and areas of missing texture. These data corruptions may affect the training data samples that are given to the network and care must be taken not to hinder the training process. The straightforward path is to filter out the erroneous data samples completely. This leads to a significant reduction in the overall size of the training set size of roughly 20%. Instead, we propose a new approach which incorporates corrupted scans without compromising the integrity of the training data. We describe our approach to learning from corrupted data in section 6.
In order to facilitate the alignment process described in subsection 5.2, we annotate each face by 43 landmark locations. These locations are determined automatically by projecting the facial surface into an image and applying one of many 2D facial landmark detectors such as Dlib (King, 2009). The landmarks are then back-projected onto the surface to determine their location. Finally, the locations of the automatically generated landmarks are manually refined in order to prevent displacements that could lead to large errors during the alignment process.
5.2. Non-Rigid Alignment
The goal of the alignment process is to find a dense correspondence between all the facial geometries. It is performed by aligning all scans to a single facial template. This correspondence is achieved by deforming the template into the scanned surface, a process guided by the pre-computed landmarks.
Initially, a rigid alignment between the scanned surface and template is performed as preprocessing step. This is done by solving for the rotation, translation, and uniform scaling between the scan and template landmarks. The deformation process is performed by defining a fitting energy which takes into account both surfaces and known landmarks and measures how closely they fit each other. The energy also includes a regularization term which penalizes non-smooth deformations. The template mesh is deformed by moving each vertex according to the energy gradient in an iterative manner.
The loss function which is minimized during the alignment process was first described by (Blanz and Vetter, 1999) and is comprised of terms which contribute to the final alignment. The first term accumulates the distances between the facial landmark points on the scanned facial surface and their corresponding points on the template mesh. The second term accumulates the distances between all the template mesh points to the scanned surface. The third term serves as a regularization, and penalizes non-smooth deformations. The loss term is minimized by taking the derivative of the loss with respect to the template vertex coordinates, then deforming the template in the gradient direction. This process naturally provides a dense point to point correspondence between each and every scanned surface.
5.3. Universal mapping
Given a facial scanned surface with unknown parametrization, our goal in this section is to discover a 2D parameterization of the surface which maps it to a unit rectangle, such that this mapping is consistent for all scans. In subsection 5.2, we described the process of aligning the facial surface template to a scanned facial surface, and by that, bring them into correspondence. The obtained correspondence allows to transfer the parametrization from the template to all scans, thus establishing a universal parametrization. In the following section, we define the unique parameterization between the template face and the unit rectangle.
The authors of (Slossberg et al., 2018) defined tbe mapping between the scan and the plane by using a ray casting technique built into the animation rendering toolbox of Blender (Blender Online Community, 2017). Figure 3 depicts several examples of the resulting mapped facial photometry. Although it would be possible to make use of the same parametrizatoin, an alternative definition may suite us better. The Blender mapping, for example, does not exploit the entire squared image for the mapping. Moreover, it does not take the facial structure into account. The eyes, nose, and mouth, for instance, clearly contain more details than smoother parts of the face such as the cheeks and forehead. It is reasonable to assume that it would be easier to learn and reconstruct the main features, perhaps at the expense of other parts if they take up a larger portion of the input images. To that end, we propose to construct a weighted parametrization that will allow us to control the relative area in the plane taken up by each facial feature.
In (Floater, 1997), the authors presented a parametrization technique that allows to choose for each vertex its baricentric coordinates with respect to its neighbors. The authors demonstrate that any set of baricentric coordinates has a unique planar graph with a valid triangulation that fulfills it. As an extension, they also provided a method for a weighted least square parametrization that allows some control over the edge lengths in the resulting parametrization. The method is briefly described as below.
Given any triangulated mesh, the object is to map it into a valid planar graph with the same connectivity. Assuming a mesh with vertices, choose a set of boundary vertices from the mesh and fix their 2D mapping values to some desired convex boundary, . For any other vertex in the mesh, choose a set of non-negative baricentric coordinates , such that , and if and only if and are not connected. Then, for , solve the linear system of equations
The authors in (Floater, 1997) prove that Equation 4 has a closed form unique solution that coincides with the chosen baricentric coordinates. According to (Floater, 1997), this technique could be extended to a weighted least square parametrization. For any desired set of weights , it was shown that the choice of
minimizes the functional , where represents the set of edges.
Following this technique, we designed the weights such that eyes, nose and mouth would recieve a larger area in the parametrization plane. We defined a weight for each vertex in the template face, and then gave each edge the average weight of its two adjacent vertices. Note that the resulting edge lengths also depend on the density of vertices in the mesh. In other words, when choosing a constant weight for all edges, the edge lengths of the resulting parametrization termed the uniform baricentric parametrization, is not constant. To design the edge weights more intuitively, we normalize the edge weights by the ones resulting from the uniform baricentric parametrization. A visualization of the edge weights is shown in Figure 3.
To choose the boundary vertices , we follow the outer boundary of the facial mesh, starting from the center bottom (a point on the chin), while measuring the length of edges we pass through, . Assume the image boundary is parametrized by for , such that is the bottom center of the image. Then, we set , where
Lastly, unlike (Slossberg et al., 2018), we propose to construct a symmetric mapping, in order to augment the data by mirroring the training samples. This could be done by ensuring that the template is intrinsically symmetric, as well as the choice of boundary vertices and edge weight. The resulting mapping and a visualization of the edge weights are shown in Figure 3. The rightmost part in Figure 3 shows that when mapping back the unwrapped texture to the facial geometry, a better resolution is obtained when using the proposed method.
6. Learning from corrupted data
The semi-automatic data acquisition pipeline described in section 5 is used to construct a dataset of 2D images that will be used to train the GAN. Naturally, some of the generated data samples contain corrupted parts due to errors in one or more of the pipeline stages. In the so-called 3D scanning process, for example, facial textures that contain hair are often not captured well. Another reasons for incomplete texture are occlusions and limited camera field of view. The geometry of the eyes is occasionally distorted due to their high specular reflection properties. In the landmark annotation stage, some landmarks can be inaccurate or even wrong, resulting in various distortions in the final output. Figure 4 provides several examples of such data corruptions.
One way to handle data corruption is to ignore imperfect images and keep only the valid ones. In our case, manual screening of the data reduced the number of samples from to only valid ones, thus, eliminating of the data. Here, we propose a novel technique for training GANs using partially incomplete data that is able to exploit undamaged parts and robustly deal with corrupted data.
To that end, we propose to pair a binary valid mask to each training data image, that represents areas in the image that should be ignored. Without loss of generality, black areas in the masks (zero values) correspond to corrupted regions in the image we would like to ignore, and white regions (values of one) correspond to valid parts we would like to exploit for training the network. We propose to multiply these valid masks by their corresponding images, as well as concatenate them as a forth channel (R-G-B-mask). Recall that the discriminator receives as an input a batch of real images and a batch of fake images. To prevent the discriminator from discriminating real and fake images by the valid masks, the same masks are multiplied and concatenated to both real and fake batches. The generator, which does not get the masks as an input, must produce complete images in-painting the masked regions. Otherwise, the discriminator would be able to easily identify masked parts that do not match the valid masks and conclude that the image is fake. The valid masks could be constructed either manually or using automatic image processing technique for detection of the unwanted parts. The discriminator and generator of the proposed GAN model are demonstrated in Figure 5.
To demonstrate the performance of the proposed GAN we constructed a synthetic dataset of different colored shapes randomly located in images of size . In this simple experiment, we treat the red circles as corruptions that we would like our model to ignore. Figure 6 shows the data images, the valid masks, and the resulting GAN output. It is clearly seen that the proposed GAN model generated new data images without the unwanted red circles.
7. Facial Surface Generator
We propose to train a model which is able to generate realistic geometries and photometries (textures or colors) of human faces. The training data for our model is constructed according to section 5, and used to train a NN according to a GAN loss. At inference, the trained model is used to produce random plausible facial textures which are mapped by our predefined parametrization described in subsection 5.3. In order to also generate corresponding facial geometries for each new texture, we propose two novel approaches. The first approach is based on training a similar model for geometries. This is done by mapping the training set geometry coordinates using the canonical parametrization into the unit rectangle. By treating each coordinate as a color channel, an we form geometry images which we use to train our geometry generator model. The second approach relies on the classical 3DMM model. For both approaches we suggest a method to generate a geometry which is a plausible fit for a given texture. In the following sections we describe the two proposed approaches in detail.
7.1. Generating textures using GAN
Our texture generation model is based on a Convolutional Neural Network which is trained using a GAN loss. Due to this loss, we are able to train a model that satisfies the distribution of real data samples, by drawing new samples out of this distribution. By training our model on our dataset which we constructed according tosection 5, we are able generate new plausible textures which are all mapped to the unit rectangle plane according to the predefined parametrization described in subsection 5.3. As we will show in the following sections, the generated textures by the proposed model present novel yet realistic human faces. Since texture and geometry are both inseparable attributes of the same geometric entity, it is necessary to take the relationship between them into account when generating the corresponding geometries. In section 8 and subsection 7.2 we describe in detail the process of the proposed geometry generation pipeline which takes as input a generated texture and produces a corresponding plausible geometry. Several outputs from the suggested texture generation model are depicted in Figure 7
7.2. Assigning geometries to textures
Once novel textures have been generated, we would like to assign them plausible synthetic geometries in order to obtain realistic face models. One way to generate geometries is by exploiting the 3DMM model by which geometries can be recovered through proper selection of the coefficients. In what follows, we discuss and compare several methods for obtaining the 3DMM geometry coefficients.
The simplest way of synthesizing a geometry to a given texture is by picking random 3DMM geometry coefficients. We follow the formulation in Equation 2. The probability of a coefficient is given by
where, is the
-th eigenvalue of the covariance of. can be computed more efficiently as , where is the -th singular value of
. To fit a geometry to a given texture, we randomize a vector of coefficients from the above probability distribution and reconstruct the geometry using the 3DMM formulation.
Random geometries are simple to generate. Yet, not every geometry can actually fit any texture. As a convincing visualization, we computed the canonical correlation (Hotelling, 1936) between the 3DMM texture and geometry coefficients, and , of the facial scans. Figure 8 shows w.r.t. , the first two canonical variables of the correlation. In what follows, we attempt to generate geometries which are suited for their textures.
7.2.2. Nearest neighbour
Given a new texture, a simple way to fit a geometry that is likely to match it, is by finding the data sample with the nearest texture, and projecting its geometry onto the 3DMM subspace. For that task, we define a distance between two textures as the norm between their 3DMM texture coefficients. Only the 3DMM texture and geometry coefficients of the data need to be stored. Nearest neighbor geometries are simple to obtain; however they are restricted to the training data geometries alone.
7.2.3. Maximum likelihood
The maximum likelihood estimator (ML) is typically used when one can formulate assumptions about the data distribution. In our case, given input facial textures, ML could be used to obtain the most likely geometries under a set of assumptions. We first construct a mutual 3DMM basis by concatenating textures and geometries. Define a vertical concatenation of geometries and textures as the matrix , such that are defined in section 3. Define , where holds the average of the rows of . Denote by the matrix that contains the first basis vectors of
, i.e., corresponding to the largest magnitude eigenvalues. These vectors can be computed either as eigenvectors ofor, more efficiently, as the left singular vectors of . Denote by and the upper halves of and , and denote by and the lower halves and , respectively, such that
Note that and , unlike and that were defined in section 3, are not orthogonal. Nevertheless, any geometry and texture of a given face in can be represented as a linear combination
where the coefficient vector is mutual to the geometry and texture. Using the notations and definitions above, any new facial texture could be approximated through a coefficient vector as
The maximum likelihood assumption is that , , and follow a multivariate normal distributions with zero mean. Given a facial texture , our goal is to compute the most likely coefficient vector under this assumption, and then obtain the most likely geometry as
Following Bayes’ rule, one could formulate the most likely coefficient vector as
Since and follow multivariate normal distributions, denote their covariance matrices by and , and their mean vectors by and , respectively. Thus,
One could obtain a closed form solution for by vanishing its gradient, which yields
We estimate the covariance matrices and empirically from the data. Since the mean of each coefficient in with respect to all samples is zero, the covariance can be estimated by
where, is the coefficient vector for face sample . The covariance matrix is very large, impractical to estimate from a few thousands of samples or to invert once estimated. Hence, for simplicity, we approximate it as a diagonal matrix that does not depend on . One can verify that the mean of each element in with respect to all samples is zero. Hence, we estimate its -th diagonal value as
7.2.4. Least squares
Least squares (LS) minimization is a simple and very useful approach that should be typically used when the amount of data samples is large enough. It can be thought of as training a multivariate linear regression with anloss. Assume a facial sample is represented by a texture vector and a geometry vector . Denote by and the column vectors with the first and texture and geometry 3DMM coefficients of the face. These coefficients can be obtained by projecting and onto the 3DMM basis and . Let the matrix hold the texture coefficient vectors of all samples in its columns, and let the matrix hold the geometry coefficient vectors of all samples in its columns in the same order as . The correlation between and could be linearly approximated by
Note that we would not benefit from generalizing from a linear to an affine correlation. This is because the mean of each row in and are zero, as they hold singular values of a centered set of samples. Following Equation 22, we would like to find a matrix that minimizes
A closed form solution is easily obtained to be
Define and as holding the first and texture and geometry 3DMM basis vectors. can be estimated using a set of training samples. Then, given a new texture , one could fit a geometry by computing the texture coefficients as
computing the geometry coefficients as
and finally, computing the geometry as
7.2.5. Geometry reconstruction method comparison
To evaluate how well the assigned geometries fit each textures, we use a test set of textures and their corresponding geometries obtained from scans that did not participate in neither the GAN training or geometry fitting procedures. Given these unseen textures as input, we estimate their geometries using the approaches presented above, and compare them to their ground truth. For this comparison, we computed the average norm between the vertices of the reconstructed and true geometry for each of the methods. In this experiment, we chose . Figure 9 shows examples of test textures mapped onto their assigned geometries that were obtained using each of the above methods.
It is clear that the LS approach obtains the best results on the test set. Since it is also simple and efficient, we choose to use the LS approach to approximate the geometries in the following sections. Note, however, that the rest of the methods could be beneficial for other applications, depending on each case. Figure 10 visually compares the reconstructed geometries to the true ones for different textures from the test scans, using the LS approach. The norm between the reconstruction and true geometries are given below each example. The geometries, predicted solely from textures of identities that were never seen before, are surprisingly very similar to the true ones. This validates the strong correlation assumption between textures and geometries.
8. Generating geometries using GAN
In subsection 7.2, we reconstructed geometries using the 3DMM model. Indeed, projecting geometries onto the subspace of 3DMM has almost no visual effect on the appearance of the faces. The 3DMM, however, is constrained to the subspace of training set geometries and cannot generalize to unseen examples.
In subsection 5.3, we mapped facial textures into 2D images, with the goal of producing new textures. The same methodology can be used for producing new geometries as well. To that end, we propose to construct a dataset of aligned facial geometries and train a GAN to generate new ones by repeating the texture mapping process while replacing its RBG texture values by its XYZ geometry values.
As for data augmentation, while the amount of texture samples can be doubled by horizontal mirroring each of the images, we found that mirroring each one of the , , and values independently results in a valid facial geometry. Thus, the amount of geometry samples can be augmented by a factor of . Note, however, that when mirroring values, one should perform , where is a constant that could be set, for example, to the maximal value in all training samples. The training data and resulting generated geometries are shown in Figure 11.
9. Adding expressions
In previous sections we proposed a method for generating new facial textures and corresponding geometries. In order to complete the model we must also take expressions into consideration. We follow (Chu et al., 2014) who define a linear basis for expressions by taking the difference for every face in the set. We then remove the mean difference vector to obtain . We compute the principal components of to obtain our geometric expression model. The expression difference model can be used by randomizing the expression coefficients and adding the linear combination of the difference vectors to a generated neutral face as so .
Since the expression model must be applied to neutral faces, we should define a model for neutral faces. In order to span only the space of neutral expressionless faces we suggest to replace all the geometries in our training set with their neutral counterpart. By following this course, the texture model still benefits from all the textures available to us while our geometry model learns to predict only neutral models for any texture with or without expression. This method can be applied to either the 3DMM based geometry model from subsection 7.2 or the GAN based geometry model described in section 8.
10. Experimental results
In order to demonstrate the ability of our model to generate new realistic identities we perform several quantitative as well as qualitative experiments. As our first results we generate several random textures and obtain their corresponding geometry according to subsection 7.2. We are able to vary the expression by applying a linear expression model as described in section 9. According to this model each expression can be represented by a variation from the mean face which leads to a specific facial movement. By combining various movements of the face one can generate any expression desired. The faces are then rendered under various poses and lighting conditions. The rendered faces are depicted in Figure 13
Our next qualitative experiment demonstrates the ability of our model to generate completely new textures by combining facial features from different training examples. To this end we search for the nearest neighbor to the generated texture from within the training data. It can be seen in Figure 14 that the demonstrated examples have nearest neighbors that are significantly different from them and cannot be considered as the same identity. Within the following section we will analyze both the generative ability of our model to produce new faces as well its realism and similarity to realistic examples. In addition, we also search for generated texture samples which are nearest to several validation set examples. By finding close by textures we demonstrate the generalization capability of our model to unseen textures. This is demonstrated in Figure 15
The previous qualitative assessment is complimented by a more in depth examination of the nearest neighbors across generated faces. For the following experiments we have freedom to choose our distance metric. We aim to find a natural metric which coincides with human perception of faces. We therefore choose to render each generated and real face from a frontal perspective and process the rendered images via a pre-trained facial recognition network. Using a model based on (Amos et al., 2016) we extract the last feature from within the network as our facial descriptor. The distance is calculated as where are the descriptors corresponding to the first and second face respectively. By analyzing the distribution of such distances we can assess the spread of identities which exists within each dataset as well as the relation between different datasets.
We use the distribution of distances between generated faces and the training and validation sets of real faces in order to assess the quality of our generative model. In Figure 14 we plot the distribution of distances between generated sample and their nearest real training sample. This plot implies that these distances are distributed as a shifted Gaussian. This implies that on average new identities are located in between the real samples and not directly near them. Our analysis of the distances to the neighbors of the validation set also depicted in Figure 14 shows that our model is able to produce identities of subjects similar to ones found in the validation set which were not used to train the model. This validates our claim that our model is producing new identities by combining features from the training set faces, and that these identities are not too similar to the training set yet can still generalize to the unseen validation set.
Following (Karras et al., 2017) we perform an analysis of sliced Wasserstein distance (Rabin et al., 2011) on our generated textures and geometries. By assessing the distance between the distribution of patches taken from textures generated by our model relative to patches taken from faces generated by 3DMM we can analyze the multi-resolution similarity between the generated and real examples. Table 3 and Table 3 show the SWD for each resolution relative to the training data. In both experiments it is clear that the SWD is lower for our model at every resolution indicating that at every level of detail the patches produced by our model are more similar to patches in the training data.
In addition to assessing the patch level feature resemblance, we wish to uncover the distances between the distribution of identities. To this end we conduct two more experiments which gauge the similarity between the distributions of generated identities to that of the real ones. In order to qualitatively assess these distributions we depict our identities using the common dimensionality reduction scheme T-SNE (Maaten and Hinton, 2008). Figure 16 depicts the low dimensional representation of the embedding proposed by our model and the 3DMM overlaid on top of the real data embedding. In addition Figure 16 also depicts the clustering of different ethnic groups as well as gender as data points of different colors. By assigning each generated sample to the nearest cluster, we can automatically assign each new sample with its nearest cluster in order to obtain automatic annotation of our generated data. In addition we perform a quantitative analysis of the difference between identity distribution using SWD. The results of this experiment are depicted in Table 3.
In this paper we present a new model for generating high detail textures and corresponding geometries of human faces. Our claim is that an effective method for processing geometric surfaces via CNNs is to first align the geometric dataset to a template model, and then map each geometry to a 2D image using a predefined mapping. Once in image form, A GAN loss can be employed to train a generator model which aims to imitate the distribution of the training data images. We further show that by training a generator for both textures and geometries it is possible to synthesize high-detail textures and geometries which are linked to each other by a single canonical mapping.
In addition, we describe in subsection 7.2 several methods for fitting 3DMM geometries by learning the underlying relation between texture and geometry, a relation that has been largely neglected in previous work. In subsection 7.2 we also provide a quantitative and qualitative evaluation of each geometry reconstruction method. our proposed face generation pipeline therefore consists of a high resolution texture generator combined with a geometry that was either produced by a similar geometric generation model or by employing a learning scheme which produces the most likely corresponding 3DMM coefficients.
Besides the main pipeline, we propose two extra data processing steps which improve sample quality. In subsection 5.3 we describe the design and construction of our canonical mapping. Our mapping by design is intended to reduce distortion in important high detail areas while spreading the flattening distortion to non essential areas. Our mapping was also designed in order to take maximal advantage of the available area in each image. In subsection 5.3 we also show that our improved mapping compared to (Slossberg et al., 2018) indeed preserves delicate texture details in our predefined high importance regions. In section 6 we also present a new technique for dealing with partially corrupted data. This is especially important when the data acquisition is expensive and prone to errors. By adding a corruption mask to the data at train time the network is able to ignore the affected areas while still learning from the mostly unaffected ones. In the case of our dataset this increases the amount of usable data by roughly .
In order to evaluate our proposed model we preformed a quantitative as well as qualitative analysis of several aspects of our model. Our main objective was to create a realistic model, a requirement which we break down into several factors. Our model should produce high quality plausible facial textures which look as much like the training data as possible, but also compose new faces not seen during training rather than repeat previously seen faces. To that end we use an efficient approximation of Wasserstein distance between distributions in order to evaluate the local and global scale features of the produced textures and geometries as well as the distance between distributions of real and generated identities. Our results show that in both identity distribution and image feature resemblance we outperform the 3DMM model which the most widely used model to date.
Acknowledgements.This research was partially supported by the Israel Ministry of Science, grant number 3-14719 and the Technion Hiroshi Fujiwara Cyber Security Research Center and the Israel Cyber Bureau. We would also like to thank Intel RealSense group for sharing their data and computational resources with us.
- Amos et al. (2016) Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. 2016. OpenFace: A general-purpose face recognition library with mobile applications. Technical Report. CMU-CS-16-118, CMU School of Computer Science.
et al. (2017)
Anil Bas, Patrik Huber,
William AP Smith, Muhammad Awais, and
Josef Kittler. 2017.
3d morphable models as spatial transformer networks. InProc. ICCV Workshop on Geometry Meets Deep Learning. 904–912.
- Blanz and Vetter (1999) Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 187–194.
- Blender Online Community (2017) Blender Online Community. 2017. Blender - a 3D modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam. http://www.blender.org.
Booth et al. (2017)
James Booth, Epameinondas
Antonakos, Stylianos Ploumpis, George
Trigeorgis, Yannis Panagakis, Stefanos
Zafeiriou, et al. 2017.
3D Face Morphable Models “In-the-Wild”. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition.
- Booth et al. (2018) James Booth, Anastasios Roussos, Allan Ponniah, David Dunaway, and Stefanos Zafeiriou. 2018. Large scale 3d morphable models. International Journal of Computer Vision 126, 2-4 (2018), 233–254.
- Booth et al. (2016) James Booth, Anastasios Roussos, Stefanos Zafeiriou, Allan Ponniah, and David Dunaway. 2016. A 3d morphable model learnt from 10,000 faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5543–5552.
- Bronstein et al. (2017) Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2017. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine 34, 4 (2017), 18–42.
- Chu et al. (2014) Baptiste Chu, Sami Romdhani, and Liming Chen. 2014. 3D-aided face recognition robust to expression and pose variations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1899–1906.
- Deng et al. (2018) Jiankang Deng, Shiyang Cheng, Niannan Xue, Yuxiang Zhou, and Stefanos Zafeiriou. 2018. UV-GAN: Adversarial Facial UV Map Completion for Pose-Invariant Face Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Floater (1997) Michael S Floater. 1997. Parametrization and smooth approximation of surface triangulations. Computer aided geometric design 14, 3 (1997), 231–250.
- Gecer et al. (2018) Baris Gecer, Binod Bhattarai, Josef Kittler, and Tae-Kyun Kim. 2018. Semi-supervised Adversarial Learning to Generate Photorealistic Face Images of New Identities from 3D Morphable Model. In The European Conference on Computer Vision (ECCV).
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
- Hotelling (1936) Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3/4 (1936), 321–377.
- Huang et al. (2017) Rui Huang, Shu Zhang, Tianyu Li, and Ran He. 2017. Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis. In Proceedings of the IEEE International Conference on Computer Vision. 2439–2448.
- Jolliffe (1986) Ian T Jolliffe. 1986. Principal component analysis and factor analysis. In Principal component analysis. Springer, 115–128.
- Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation. nternational Conference on Learning Representations (ICLR) (2017).
- King (2009) Davis E. King. 2009. Dlib-ml: A Machine Learning Toolkit. Journal of Machine Learning Research 10 (2009), 1755–1758.
- Litany et al. (2018) Or Litany, Alex Bronstein, Michael Bronstein, and Ameesh Makadia. 2018. Deformable Shape Completion With Graph Convolutional Autoencoders. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605.
- Masi et al. (2016) Iacopo Masi, Anh Tuấn Trần, Tal Hassner, Jatuporn Toy Leksut, and Gérard Medioni. 2016. Do we really need to collect millions of faces for effective face recognition?. In European Conference on Computer Vision. Springer, 579–596.
- Rabin et al. (2011) Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. 2011. Wasserstein barycenter and its application to texture mixing. In International Conference on Scale Space and Variational Methods in Computer Vision. Springer, 435–446.
- Ranjan et al. (2018) Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J Black. 2018. Generating 3D Faces Using Convolutional Mesh Autoencoders. In European Conference on Computer Vision. Springer, 725–741.
- Richardson et al. (2016) Elad Richardson, Matan Sela, and Ron Kimmel. 2016. 3D face reconstruction by learning from synthetic data. In 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 460–469.
- Richardson et al. (2017) Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel. 2017. Learning detailed face reconstruction from a single image. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 5553–5562.
- Riza Alp Güler (2016) Epameinondas Antonakos Patrick Snape Stefanos Zafeiriou Iasonas Kokkinos Riza Alp Güler, George Trigeorgis. 2016. DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild. arXiv:1612.01202 (2016).
- Saito et al. (2017) Shunsuke Saito, Lingyu Wei, Liwen Hu, Koki Nagano, and Hao Li. 2017. Photorealistic facial texture inference using deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Vol. 3.
et al. (2017)
Matan Sela, Elad
Richardson, and Ron Kimmel.
Unrestricted facial geometry reconstruction using image-to-image translation. In2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 1585–1594.
- Shamai et al. (2018) Gil Shamai, Michael Zibulevsky, and Ron Kimmel. 2018. Efficient Inter-Geodesic Distance Computation and Fast Classical Scaling. IEEE transactions on pattern analysis and machine intelligence (2018).
- Shrivastava et al. (2017) Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, and Russ Webb. 2017. Learning from simulated and unsupervised images through adversarial training. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 3. 6.
- Slossberg et al. (2018) Ron Slossberg, Gil Shamai, and Ron Kimmel. 2018. High Quality Facial Surface and Texture Synthesis via Generative Adversarial Networks. arXiv preprint arXiv:1808.08281 (2018).
- Tewari et al. (2018) Ayush Tewari, Michael Zollhöfer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim, Patrick Pérez, and Christian Theobalt. 2018. Self-Supervised Multi-Level Face Model Learning for Monocular Reconstruction at Over 250 Hz. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Tran et al. (2017a) Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gérard Medioni. 2017a. Regressing robust and discriminative 3D morphable models with a very deep neural network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1493–1502.
- Tran and Liu (2018) Luan Tran and Xiaoming Liu. 2018. Nonlinear 3D Face Morphable Model. arXiv preprint arXiv:1804.03786 (2018).
- Tran et al. (2017b) Luan Tran, Xi Yin, and Xiaoming Liu. 2017b. Disentangled Representation Learning GAN for Pose-Invariant Face Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1415–1424.
- Weise et al. (2009) Thibaut Weise, Hao Li, Luc Van Gool, and Mark Pauly. 2009. Face/off: Live facial puppetry. In Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer animation. ACM, 7–16.
- Wu et al. (2016) Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and Josh Tenenbaum. 2016. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems. 82–90.