The reconstruction of a 3d
face geometry and texture is one of the most popular and well-studied fields in the intersection of computer vision, graphics and machine learning. Apart from its countless applications, it demonstrates the power of recent developments in scanning, learning and synthesizing3d objects [blanz1999morphable, zhou2019dense]
. Recently, mainly due to the advent of deep learning, tremendous progress has been made in the reconstruction of a smooth3d face geometry, even from images captured in arbitrary recording conditions (also referred to as “in-the-wild”) [gecer2019synthesizing, gecer2019ganfit, sela2017unrestricted, tewari2018self, tran2019learning]. Nevertheless, even though the geometry can be inferred somewhat accurately, in order to render a reconstructed face in arbitrary virtual environments, much more information than a 3d smooth geometry is required, i.e., skin reflectance as well as high-frequency normals. In this paper, we propose a meticulously designed pipeline for the reconstruction of high-resolution render-ready faces from “in-the-wild” images captured in arbitrary poses, lighting conditions and occlusions. A result from our pipeline is showcased in Fig. 1.
The seminal work in the field is the 3d Morphable Model (3dmm) fitting algorithm [blanz1999morphable]. The facial texture and shape that is reconstructed by the 3dmm
algorithm always lies in a space that is spanned by a linear basis which is learned by Principal Component Analysis (pca). The linear basis, even though remarkable in representing the basic characteristics of the reconstructed face, fails in reconstructing high-frequency details in texture and geometry. Furthermore, the pca model fails in representing the complex structure of facial texture captured “in-the-wild”. Therefore, 3dmm fitting usually fails on “in-the-wild” images. Recently, 3dmm fitting has been extended so that it uses a pca model on robust features, i.e., Histogram of Oriented Gradients (HoGs) [dalal2005histograms], for representing facial texture [booth20173d]. The method has shown remarkable results in reconstructing the 3d facial geometry from “in-the-wild” images. Nevertheless, it cannot reconstruct facial texture that accurately.
With the advent of deep learning, many regression methods using an encoder-decoder structure have been proposed to infer 3d geometry, reflectance and illumination [chen2019photo, gecer2019ganfit, sela2017unrestricted, shu2017neural, tewari2018self, tran2019learning, wang2019adversarial, zhou2019dense]. Some of the methods demonstrate that it is possible to reconstruct shape and texture, even in real-time on a cpu [zhou2019dense]. Nevertheless, due to various factors, such as the use of basic reflectance models (e.g., the Lambertian reflectance model), the use of synthetic data or mesh-convolutions on colored meshes, the methods [sela2017unrestricted, shu2017neural, tewari2018self, tran2019learning, wang2019adversarial, zhou2019dense] fail to reconstruct highly-detailed texture and shape that is render-ready. Furthermore, in many of the above methods the reconstructed texture and shape lose many of the identity characteristics of the original image.
Arguably, the first generic method that demonstrated that it is possible to reconstruct high-quality texture and shape from single “in-the-wild” images is the recently proposed ganfit method [gecer2019ganfit]. ganfit can be described as an extension of the original 3dmm fitting strategy but with the following differences: (a) instead of a pca texture model, it uses a Generative Adversarial Network (gan) [karras2017progressive] trained on large-scale high-resolution uv
-maps, and (b) in order to preserve the identity in the reconstructed texture and shape, it uses features from a state-of-the-art face recognition network[deng2019arcface]. However, the reconstructed texture and shape is not render-ready due to (a) the texture containing baked illumination, and (b) not being able to reconstruct high-frequency normals or specular reflectance.
Early attempts to infer photorealistic render-ready information from single “in-the-wild” images have been made in the line of research of [chen2019photo, huynh2018mesoscopic, saito2017photorealistic, yamaguchi2018high]. Arguably, some of the results showcased in the above noted papers are of high-quality. Nevertheless, the methods do not generalize since: (a) they directly manipulate and augment the low-quality and potentially occluded input facial texture, instead of reconstructing it, and as a result, the quality of the final reconstruction always depends on the input image. (b) the employed 3d model is not very representative, and (c) a very small number of subjects (e.g., [yamaguchi2018high]) were available for training for the high-frequency details of the face. Thus, while closest to our work, these approaches focus on easily creating a digital avatar rather than high-quality render-ready face reconstruction from “in-the-wild” images which is the goal of our work.
In this paper, we propose the first, to the best of our knowledge, methodology that produces high-quality render-ready face reconstructions from arbitrary images. In particular, our method builds upon recent reconstruction methods (e.g., ganfit [gecer2019ganfit]) and contrary to [chen2019photo, yamaguchi2018high]
does not apply algorithms for high-frequency estimation to the original input, which could be of very low quality, but to agan-generated high-quality texture. Using a light stage, we have collected a large scale dataset with samples of over 200 subjects’ reflectance and geometry and we train image translation networks that can perform estimation of (a) diffuse and specular albedo, and (b) diffuse and specular normals. We demonstrate that it is possible to produce render-ready faces from arbitrary faces (pose, occlusion, etc.) including portraits and face sketches, which can be realistically relighted in any environment.
2 Related Work
2.1 Facial Geometry and Reflectance Capture
Debevec et al. [Debevec:2000] first proposed employing a specialized light stage setup to acquire a reflectance field of a human face for photo-realistic image-based relighting applications. They also employed the acquired data to estimate a few view-dependent reflectance maps for rendering. Weyrich et al. [Weyrich:2006] employed an LED sphere and 16 cameras to densely record facial reflectance and computed view-independent estimates of facial reflectance from the acquired data including per-pixel diffuse and specular albedos, and per-region specular roughness parameters. These initial works employed dense capture of facial reflectance which is somewhat cumbersome and impractical.
Ma et al. [ma2007rapid] introduced polarized spherical gradient illumination (using an LED sphere) for efficient acquisition of separated diffuse and specular albedos and photometric normals of a face using just eight photographs, and demonstrated high quality facial geometry, including skin mesostructure as well as realistic rendering with the acquired data. It was however restricted to a frontal viewpoint of acquisition due to their employment of view-dependent polarization pattern on the LED sphere. Subsequently, Ghosh et al. [ghosh2011multiview] extended polarized spherical gradient illumination for multi-view facial acquisition by employing two orthogonal spherical polarization patterns. Their method allows capture of separated diffuse and specular reflectance and photometric normals from any viewpoint around the equator of the LED sphere and can be considered the state-of-the art in terms of high quality facial capture.
Recently, Kampouris et al. [kampouris2018diffuse] demonstrated how to employ unpolarized binary spherical gradient illumination for estimating separated diffuse and specular albedo and photometric normals using color-space analysis. The method has the advantage of not requiring polarization and hence requires half the number of photographs compared to polarized spherical gradients and enables completely view-independent reflectance separation, making it faster and more robust for high quality facial capture [lattas2019multi].
Passive multiview facial capture has also made significant progress in recent years, from high quality facial geometry capture [Beeler:2010] to even detailed facial appearance estimation [Gotardo:2018]. However, the quality of the acquired data with such passive capture methods is somewhat lower compared to active illumination techniques.
In this work, we employ two state-of-the-art active illumination based multiview facial capture methods [ghosh2011multiview, lattas2019multi] for acquiring high quality facial reflectance data in order to build our training data.
2.2 Image-to-Image Translation
Image-to-image translation refers to the task of translating an input image to a designated target domain (e.g., turning sketches into images, or day into night scenes). With the introduction of gans [goodfellow2014generative], image-to-image translation improved dramatically [isola2017image, zhu2017unpaired]. Recently, with the increasing capabilities in the hardware, image-to-image translation has also been successfully attempted in high-resolution data [wang2018high]. In this work we utilize variations of pix2pixHD [wang2018high] to carry out tasks such as de-lighting and the extraction of reflectance maps in very high-resolution.
2.3 Facial Geometry Estimation
Over the years, numerous methods have been introduced in the literature that tackle the problem of 3d facial reconstruction from a single input image. Early methods required a statistical 3dmm both for shape and appearance, usually encoded in a low dimensional space constructed by pca [blanz1999morphable, booth20173d]
. Lately, many approaches have tried to leverage the power of Convolutional Neural Networks (cnns) to either regress the latent parameters of a pca model [tuan2017regressing, cole2017synthesizing] or utilize a 3dmm to synthesize images and formulate an image-to-image translation problem using cnns [guo2018cnn, richardson20163d].
2.4 Photorealistic 3D faces with Deep Learning
Many approaches have been successful in acquiring the reflectance of materials from a single image, using deep networks with an encoder-decoder architecture [deschaintre2018single, li2017modeling, li2018materials]. However, they only explore 2d surfaces and in a constrained environment, usually assuming a single point-light source.
Early applications on human faces [sengupta2018sfsnet, shu2017neural] used image translation networks to infer facial reflection from an “in-the-wild” image, producing low-resolution results. Recent approaches attempt to incorporate additional facial normal and displacement mappings resulting in representations with high frequency details [chen2019photo]. Although this method demonstrates impressive results in geometry inference, it tends to fail in conditions with harsh illumination and extreme head poses, and does not produce re-lightable results. Saito et al. [saito2017photorealistic] proposed a deep learning approach for data-driven inference of high resolution facial texture map of an entire face for realistic rendering, using an input of a single low-resolution face image with partial facial coverage. This has been extended to inference of facial mesostructure, given a diffuse albedo [huynh2018mesoscopic], and even complete facial reflectance and displacement maps besides albedo texture, given partial facial image as input [yamaguchi2018high]. While closest to our work, these approaches achieve the creation of digital avatars, rather than high quality facial appearance estimation from “in-the-wild” images. In this work, we try to overcome these limitations by employing an iterative optimization framework as proposed in [gecer2019ganfit]. This optimization strategy leverages a deep face recognition network and gans into a conventional fitting method in order to estimate the high-quality geometry and texture with fine identity characteristics, which can then be used to produce high-quality reflectance maps.
3 Training Data
3.1 Ground Truth Acquisition
We employ the state-of-the-art method of [ghosh2011multiview] for capturing high resolution pore-level reflectance maps of faces using a polarized led sphere with 168 lights (partitioned into two polarization banks) and 9 dslr cameras. Half the leds on the sphere are vertically polarized (for parallel polarization), and the other half are horizontally polarized (for cross-polarization) in an interleaved pattern.
Using the led sphere, we can also employ the color-space analysis from unpolarised leds [kampouris2018diffuse] for diffuse-specular separation and the multi-view facial capture method of [lattas2019multi] to acquire unwrapped textures of similar quality (Fig. 3). This method requires less than half of data captured (hence reduced capture time) and a simpler setup (no polarizers), enabling the acquisition of larger datasets.
3.2 Data Collection
In this work, we capture faces of over 200 individuals of different ages and characteristics under 7 different expressions. The geometry reconstructions are registered to a standard topology, like in [booth20163d], with unwrapped textures as shown in Fig. 3. We name the dataset RealFaceDB. It is currently the largest dataset of this type and we intend to make it publicly available to the scientific community 111 For the dataset and other materials we refer the reader to the project’s page https://github.com/lattas/avatarme. .
To achieve photorealistic rendering of the human skin, we separately model the diffuse and specular albedo and normals of the desired geometry. Therefore, given a single unconstrained face image as input, we infer the facial geometry as well as the diffuse albedo (), diffuse normals () 222 The diffuse normals are not usually used in commercial rendering systems. By inferring we can model the reflection as in the state-of-the-art specular-diffuse separation techniques [ghosh2011multiview, lattas2019multi]., specular albedo (, and specular normals ().
As seen in Fig. 2, we first reconstruct a 3d face (base geometry with texture) from a single image at a low resolution using an existing 3dmm algorithm [booth20163d]
. Then, the reconstructed texture map, which contains baked illumination, is enhanced by a super resolution network, followed by a de-lighting network to obtain a high resolution diffuse albedo. Finally, we infer the other three components () from the diffuse albedo in conjunction with the base geometry. The following sections explain these steps in detail.
4.1 Initial Geometry and Texture Estimation
Our method requires a low-resolution 3d reconstruction of a given face image . Therefore, we begin with the estimation of the facial shape with vertices and texture by borrowing any state-of-the-art 3d face reconstruction approach (we use ganfit [gecer2019ganfit]). Apart from the usage of deep identity features, ganfit synthesizes realistic texture uv maps using a gan as a statistical representation of the facial texture. We reconstruct the initial base shape and texture of the input image as follows and refer the reader to [gecer2019ganfit] for further details:
where denotes the ganfit reconstruction method for an arbitrary sized image, and number of vertices on a fixed topology.
Having acquired the prerequisites, we procedurally improve on them: from the reconstructed geometry , we acquire the shape normals and enhance the facial texture resolution, before using them to estimate the components for physically based rendering, such as the diffuse and specular diffuse and normals.
Although the texture from ganfit [gecer2019ganfit] has reasonably good quality, it is below par compared to artist-made render-ready 3d faces. To remedy that, we employ a state-of-the-art super-resolution network, rcan [zhang2018image], to increase the resolution of the uv maps from to , which is then retopologized and up-sampled to . Specifically, we train a super-resolution network with the texture patches of the acquired low-resolution texture . At the test time, the whole texture from ganfit is upscaled by the following:
4.3 Diffuse Albedo Extraction by De-lighting
A significant issue of the texture produced by 3dmms is that they are trained on data with baked illumination (i.e. reflection, shadows), which they reproduce. ganfit-produced textures contain sharp highlights and shadows, made by strong point-light sources, as well as baked environment illumination, which prohibits photorealistic rendering. In order to alleviate this problem, we first model the illumination conditions of the dataset used in [gecer2019ganfit] and then synthesize UV maps with the same illumination in order to train an image-to-image translation network from texture with baked-illumination to unlit diffuse albedo . Further details are explained in the following sections.
4.3.1 Simulating Baked Illumination
Firstly, we acquire random texture and mesh outputs from ganfit. Using a cornea model [nishino2004eyes], we estimate the average direction of the apparent 3 point light sources used, with respect to the subject, and an environment map for the textures . The environment map produces a good estimation of the environment illumination of ganfit’s data while the 3 light sources help to simulate the highlights and shadows. Thus, we render our acquired 200 subjects (Section 3), as if they were samples from the dataset used in the training of [gecer2019ganfit], while also having accurate ground truth of their albedo and normals. We compute a physically-based rendering for each subject from all view-points, using the predicted environment map and the predicted light sources with a random variation of their position, creating an illuminated texture map. We denote this whole simulation process by which translates diffuse albedo to the distribution of the textures with baked illumination, as shown in the following:
4.3.2 Training the De-lighting Network
Given the simulated illumination as explained in Sec. 4.3.1, we now have access to a version of RealFaceDB with the [gecer2019ganfit]-like illumination and with the corresponding diffuse albedo . We formulate de-lighting as a domain adaptation problem and train an image-to-image translation network. To do this, we follow two strategies different from the standard image translation approaches.
Firstly, we find that the occlusion of illumination on the skin surface is geometry-dependent and thus the resulting albedo improves in quality when feeding the network with both the texture and geometry of the 3dmm. To do so, we simply normalize the texture channels to and concatenate them with the depth of the mesh in object space , also in . The depth () is defined as the dimension of the vertices of the acquired and aligned geometries, in a uv map. We feed the network with a 4dtensor of and predict the resulting 3-channel albedo . Alternatively, we can also use as an input the texture concatenated with the normals in object space (). We found that feeding the network only with the texture map causes artifacts in the inference. Secondly, we split the original high resolution data into overlapping patches of pixels in order to augment the number of data samples and avoid overfitting.
In order to remove existing illumination from , we train an image-to-image translation network with patches and then extract the diffuse albedo by the following:
4.4 Specular Albedo Extraction
Predicting the entire specular brdf and the per-pixel specular roughness from the illuminated texture or the inferred diffuse albedo , poses an unnecessary challenge. As shown in [ghosh2011multiview, kampouris2018diffuse] a subject can be realistically rendered using only the intensity of the specular reflection , which is consistent on a face due to the skin’s refractive index. The spatial variation is correlated to facial skin structures such as skin pores, wrinkles or hair, which act as reflection occlusions reducing the specular intensity.
In principle, the specular albedo can also be computed from the texture with the baked illumination, since the texture includes baked specular reflection. However, we empirically found that the specular component is strongly biased due to the environment illumination and occlusion. Having computed a high quality diffuse albedo from the previous step, we infer the specular albedo by a similar patch-based image-to-image translation network from the diffuse albedo () trained on RealFaceDB:
4.5 Specular Normals Extraction
The specular normals exhibit sharp surface details, such as fine wrinkles and skin pores, and are challenging to estimate, as the appearance of some high-frequency details is dependent on the lighting conditions and viewpoint of the texture. Previous works fail to predict high-frequency details [chen2019photo], or rely on separating the mid- and high-frequency information in two separate maps, as a generator network may discard the high-frequency as noise [yamaguchi2018high]. Instead, we show that it is possible to employ an image-to-image translation network with feature matching loss on a large high-resolution training dataset, which produces more detailed and accurate results.
Similarly to the process for the specular albedo, we prefer the diffuse albedo over the reconstructed texture map , as the latter includes sharp highlights that get wrongly interpreted as facial features by the network. Moreover, we found that even though the diffuse albedo is stripped from specular reflection, it contains the facial skin structures that define mid- and high-frequency details, such as pores and wrinkles. Finally, since the facial features are similarly distributed across the color channels, we found that instead of the diffuse albedo , we can use the luma-transformed (in srgb) grayscale diffuse albedo ().
Again, we found that the network successfully generates both the mid- and high-frequency, when it receives as input the detailed diffuse albedo together with the lower-resolution geometry information (in this case, the shape normals). Moreover, the resulting high-frequency details are more accentuated, when using normals in tangent space (), which also serve as a better output, since most commercial applications require the normals in tangent space.
We train a translation network , to map the concatenation of the grayscale diffuse albedo and the shape normals in tangent space to the specular normals . The specular normals are extracted by the following:
4.6 Diffuse Normals Extraction
The diffuse normals are highly correlated with the shape normals, as diffusion is scattered uniformly across the skin. Scars and wrinkles alter the distribution of the diffusion and some non-skin features such as hair that do not exhibit significant diffusion.
Similarly to the previous section, we train a network to map the concatenation of the grayscale diffuse albedo and the shape normals in object space to the diffuse normals . The diffuse normals are extracted as:
Finally, the inferred normals can be used to enhance the reconstructed geometry, by refining its features and adding plausible details. We integrate over the specular normals in tangent space and produce a displacement map which can then be embossed on a subdivided base geometry.
5.1 Implementation Details
5.1.1 Patch-Based Image-to-image translation
The tasks of de-lighting, as well as inferring the diffuse and specular components from a given input image (uv) can be formulated as domain adaptation problems. As a result, to carry out the aforementioned tasks the model of our choice is pix2pixHD [wang2018high], which has shown impressive results in image-to-image translation on high-resolution data.
Nevertheless, as discussed previously: (a) our captured data are of very high-resolution (more than 4k) and thus cannot be used for training “as-is” utilizing pix2pixHD, due to hardware limitations (note not even on a 32gb gpu we can fit such high-resolution data in their original format), (b) pix2pixHD [wang2018high] takes into account only the texture information and thus geometric details, in the form of the shape normals and depth cannot be exploited to improve the quality of the generated diffuse and specular components.
To alleviate the aforementioned shortcomings, we: (a) split the original high-resolution data into smaller patches of
size. More specifically, using a stride of size, we derive the partially overlapping patches by passing through each original uv horizontally as well as vertically, (b) for each translation task, we utilize the shape normals, concatenate them channel-wise with the corresponding grayscale texture input (e.g., in the case of translating the diffuse albedo to the specular normals, we concatenate the grayscale diffuse albedo with the shape normals channel-wise) and thus feed a 4d tensor () to the network. This increases the level of detail in the derived outputs as the shape normals act as a geometric “guide”. Note that during inference that patch size can be larger (e.g. ), since the network is fully-convolutional.
5.1.2 Training Setup
To train rcan [zhang2018image], we use the default hyper-parameters. For the rest of the translation of models, we use a custom translation network as described earlier, which is based on pix2pixHD [wang2018high]. More specifically, we use and residual blocks in the global and local generators, respectively. The learning rate we employed is , whereas the Adam betas are for and for . Moreover, we do not use the vgg features matching loss as this slightly deteriorated the performance. Finally, we use as inputs and channel tensors which include the shape normals or depth together with the rgb or grayscale values of the inputs. As mentioned earlier, this substantially improves the results by accentuating the details in the translated outputs.
is provided by the authors and [chen2019photo]
from their open-sourced models. Last column is cropped to better show the details.
We conduct quantitative as well as qualitative comparisons against the state-of-the-art. For the quantitative comparisons, we utilize the widely used psnr metric [hore2010image], and report the results in Table 1. As can be seen, our method outperforms [chen2019photo] and [yamaguchi2018high] by a significant margin. Moreover using a state-of-the-art face recognition algorithm [deng2019arcface], we also find the highest match of facial identity compared to the input images when using our method. The input images were compared against renderings of the faces with reconstructed geometry and reflectance, including eyes.
For the qualitative comparisons, we perform 3d reconstructions of “in-the-wild” images. As shown in Figs. 8 and 9, our method does not produce any artifacts in the final renderings and successfully handles extreme poses and occlusions such as sunglasses. We infer the texture maps in a patch-based manner from high-resolution input, which produces higher-quality details than [chen2019photo, yamaguchi2018high], who train on high-quality scans but infer the maps for the whole face, in lower resolution. This is also apparent in Fig. 5, which shows our reconstruction after each step of our process. Moreover, we can successfully acquire each component from black-and-white images (Fig. 9) and even drawn portraits (Fig. 8).
Furthermore, we experiment with different environment conditions, in the input images and while rendering. As presented in Fig. 7, the extracted normals, diffuse and specular albedos are consistent, regardless of the illumination on the original input images. Finally, Fig. 6 shows different subjects rendered under different environments. We can realistically illuminate each subject in each scene and accurately reconstruct the environment reflectance, including detailed specular reflections and subsurface scattering.
In addition to the facial mesh, we are able to infer the entire head topology based on the Universal Head Model (uhm) [ploumpis2019towards, ploumpis2019combining]. We project our facial mesh to a subspace, regress the head latent parameters and then finally derive the completed head model with completed textures. Some qualitative head completion results can be seen in Figs 1, 2.
While our dataset contains a relatively large number of subjects, it does not contain sufficient examples of subjects from certain ethnicities. Hence, our method currently does not perform that well when we reconstruct faces of e.g. darker skin subjects. Also, the reconstructed specular albedo and normals exhibit slight blurring of some high frequency pore details due to minor alignment errors of the acquired data to the template 3dmm model. Finally, the accuracy of facial reconstruction is not completely independent of the quality of the input photograph, and well-lit, higher resolution photographs produce more accurate results.
In this paper, we propose the first methodology that produces high-quality rendering-ready face reconstructions from arbitrary “in-the-wild” images. We build upon recently proposed 3d face reconstruction techniques and train image translation networks that can perform estimation of high quality (a) diffuse and specular albedo, and (b) diffuse and specular normals. This is made possible with a large training dataset of 200 faces acquired with high quality facial capture techniques. We demonstrate that it is possible to produce rendering-ready faces from arbitrary face images varying in pose, occlusions, etc., including black-and-white and drawn portraits. Our results exhibit unprecedented level of detail and realism in the reconstructions, while preserving the identity of subjects in the input photographs.
AL was supported by EPSRC Project DEFORM (EP/S010203/1) and SM by an Imperial College FATA. AG acknowledges funding by the EPSRC Early Career Fellowship (EP/N006259/1) and SZ from a Google Faculty Fellowship and the EPSRC Fellowship DEFORM (EP/S010203/1).