1 Introduction
Until recently, the digitization of photorealistic faces has only been possible in professional studio settings, typically involving sophisticated appearance measurement devices [55, 34, 20, 4, 19] and carefully controlled lighting conditions. While such a complex acquisition process is acceptable for production purposes, the ability to build highend 3D face models from a single unconstrained image could widely impact new forms of immersive communication, education, and consumer applications. With virtual and augmented reality becoming the next generation platform for social interaction, compelling 3D avatars could be generated with minimal efforts and pupeteered through facial performances [31, 39]. Within the context of cultural heritage, iconic and historical personalities could be restored to life in captivating 3D digital forms from archival photographs. For example:
can we use Computer Vision to bring back our favorite boxing legend, Muhammad Ali, and relive his greatest moments in 3D
?Capturing accurate and complete facial appearance properties from images in the wild is a fundamentally illposed problem. Often the input pictures have limited resolution, only a partial view of the subject is available, and the lighting conditions are unknown. Most stateoftheart monocular facial capture frameworks [52, 45] rely on linear PCA models [6] and important appearance details for photorealistic rendering such as complex skin pigmentation variations and mesoscopiclevel texture details (freckles, pores, stubble hair, etc.), cannot be modeled. Despite recent efforts in hallucinating details using datadriven techniques [32, 37]
and deep learning inference
[13], it is still not possible to reconstruct highresolution textures while preserving the likeness of the original subject and ensuring photorealism.From a single unconstrained image (potentially low resolution), our goal is to infer a highfidelity textured 3D model which can be rendered in any virtual environment. The highresolution albedo texture map should match the resemblance of the subject while reproducing mesoscopic facial details. Without capturing advanced appearance properties (bump maps, specularity maps, BRDFs, etc.), we want to show that photorealistic renderings are possible using a reasonable shape estimation, a productionlevel rendering framework [49], and, most crucially, a highfidelity albedo texture. The core challenge consists of developing a facial texture inference framework that can capture the immense appearance variations of faces and synthesize realistic highresolution details, while maintaining fidelity to the target.
Inspired by the latest advancement in neural synthesis algorithms for style transfer [18, 17], we adopt a factorized representation of lowfrequency and highfrequency albedo as illustrated in Figure 2. While the lowfrequency map is simply represented by a linear PCA model (Section 3), we characterize highfrequency texture details for mesoscopic structures as midlayer feature correlations of a deep convolutional neural network for general image recognition [48]. Partial feature correlations are first analyzed on an incomplete texture map extracted from the unconstrained input image. We then infer complete feature correlations using a convex combination of feature correlations obtained from a large database of highresolution face textures [33] (Section 4). A highresolution albedo texture map can then be synthesized by iteratively optimizing an initial lowfrequency albedo texture to match these feature correlations via backpropagation and quasiNewton optimization (Section 5). Our highfrequency detail representation with feature correlations captures highlevel facial appearance information at multiple scales, and ensures plausible mesoscopiclevel structures in their corresponding regions. The blending technique with convex combinations in feature correlation space not only handles the large variation and nonlinearity of facial appearances, but also generates highresolution texture maps, which is not possible with existing endtoend deep learning frameworks [13]. Furthermore, our method uses the publicly available and pretrained deep convolutional neural network, VGG19 [48], and requires no further training. We make the following contributions:

We introduce an inference method that can generate highresolution albedo texture maps with plausible mesoscopic details from a single unconstrained image.

We show that semantically plausible finescale details can be synthesized by blending highresolution textures using convex combinations of feature correlations obtained from midlayer deep neural net filters.

We introduce a new dataset of 3D face models with highfidelity texture maps based on highresolution photographs of the Chicago Face Database [33], which will be publicly available to the research community.
2 Related Work
Facial Appearance Capture.
Specialized hardware for facial capture, such as the Light Stage, has been introduced by Debevec et al. [10] and improved over the years [34, 19, 20], with full sphere LEDs and multiple cameras to measure an accurate reflectance field. Though restricted to studio environments, productionlevel relighting and appearance measurements (bump maps, specular maps, subsurface scattering etc.) are possible. Weyrich et al. [55] adopted a similar system to develop a photorealistic skin reflectance model for statistical appearance analysis and mesoscale texture synthesis. A contactbased apparatus for pathbased microstructure scale measurement using silicone mold material has been proposed by Haro et al. [23]. Optical acquisition methods have also been suggested to produce fullfacial microstructure details [22] and skin microstructure deformations [38]. As an effort to make facial digitization more deployable, monocular systems [50, 24, 9, 47, 52] that record multiple views have recently been introduced to generate seamlessly integrated texture maps for virtual avatars. When only a single input image is available, KemelmacherShlizerman and Basri [27] proposed a shapefromshading framework that produces an albedo map using a Lambertian reflectance model. Barron and Malik [3] introduced a statistical approach to estimate shape, illumination, and reflectance from arbitrary objects. Li et al. [29] later presented an intrinsic image decomposition technique to separate diffuse and specular components for faces. For all these methods, only textures from the visible regions can be computed and the resolution is limited by the input.
Linear Face Models.
Turk and Pentland [53]
introduced the concept of Eigenfaces for face recognition and were one of the first to represent facial appearances as linear models. In the context of facial tracking, Edwards et al.
[14] developed the widely used active appearance models (AAM) based on linear combinations of shape and appearance, which has resulted in several important subsequent works [1, 12, 36]. The seminal work on morphable face models of Blanz and Vetter [6]has put forward an analysisbysynthesis framework for textured 3D face modeling and lighting estimation. Since their Principal Component Analysis (PCA)based face model is built from a database of 3D face scans, a complete albedo texture map can be estimated robustly from a single image. Several extensions have been proposed leveraging internet images
[26] and largescale 3D facial scans [7]. PCAbased models are fundamentally limited by their linear assumption and fail to capture mesoscopic details as well as large variations in facial appearances (e.g., hair texture).Texture Synthesis.
Nonparametric synthesis algorithms [16, 54, 15, 28] have been developed to synthesize repeating structures using samples from small patches, while ensuring local consistency. These general techniques only work for stochastic textures such as microscale skin structures [23]
, but are not directly applicable to mesoscopic face details due to the lack of highlevel visual cues about facial configurations. The super resolution technique of Liu et al.
[32] hallucinates highfrequency content using a local pathbased Markov network, but the results remain relatively blurry and cannot predict missing regions. Mohammed et al. [37] introduced a statistical framework for generating novel faces based on randomized patches. While the generated faces look realistic, noisy artifacts appear for highresolution images. Facial detail enhancement techniques based on statistical models [21] have been introduced to synthesize pores and wrinkles, but have only been demonstrated in the geometric domain.Deep Learning Inference.
Leveraging the vast learning capacity of deep neural networks and their ability to capture higher level representations, Duong et al. [13]
introduced an inference framework based on Deep Boltzmann Machines that can handle the large variation and nonlinearity of facial appearances effectively. A different approach consists of predicting nonvisible regions based on context information. Pathak et al.
[40] adopted an encoderdecoder architecture trained with a Generative Adverserial Network (GAN) for general inpainting tasks. However, due to fundamental limitations of existing endtoend deep neural networks, only images with very small resolutions can be processed. Gatys et al. [18, 17] recently proposed a styletransfer technique using deep neural networks that has the ability to seamlessly blend the content from one highresolution image with the style of another while preserving consistent structures of low and highlevel visual features. They describe style as midlayer feature correlations of a convolutional neural network. We show in this work that these feature correlations are particularly effective in representing highfrequency multiscale appearance components including mesoscopic facial details.3 Initial Face Model Fitting
We begin with an initial joint estimation of facial shape and low frequency albedo, and produce a complete texture map using a PCAbased morphable face model [6] (Figure 2). Given an unconstrained single input image, we compute a face shape , an albedo map , the rigid head pose , and the perspective transformation with the camera parameters . A partial highfrequency albedo map is then extracted from the visible area and represented in the UV space of the shape model. This partial highfrequency map is later used to extract feature correlations in the texture analysis stage (Section 4) and the complete lowfrequency albedo map used as initialization for the texture synthesis step (Section 5). Our initial PCA model fitting framework is built upon the previous work [52]. Here we briefly describe the main ideas and highlight key differences.
PCA Model Fitting.
The lowfrequency facial albedo and the shape are represented as a multilinear PCA model with k vertices and 106k faces:
where the identity, expression, and albedo are represented as a multivariate normal distribution with the corresponding basis:
, , and , the mean: , and, and the corresponding standard deviation:
, , and . We assume Lambertian surface reflectance and model the illumination using a second order Spherical Harmonics (SH) [42], denoting the illumination . We use the Basel Face Model dataset [41] for , , , and , and FaceWarehouse [8] for provided by [56]. Following the implementation in [52] we compute all the unknowns with the following objective function:(1) 
with energy term weights , , and . The photoconsistency term minimizes the distance between the synthetic face and the input image, the landmark term minimizes the distance between the facial features of the shape and the detected landmarks, and the regularization term penalizes the deviation of the face from the normal distribution. We augment the term in [52] with a visibility component:
where is the input image, the synthesized image, and a visibility pixel computed from a semantical facial segmentation estimated using a twostream deconvolution network introduced by Saito et al. [46]. The segmentation mask ensures that the objective function is computed with valid face pixels for improved robustness in the presence of occlusion. The landmark fitting term and the regularization term are defined as:
where is a 2D facial feature obtained from the method of Kazemi et al. [25]. The objective function is optimized using a GaussNewton solver based on iteratively reweighted least squares with three levels of image pyramids (see [52] for details). In our experiments, the optimization converges within 30, 10, and 3 GaussNewton steps respectively from the coarsest level to the finest.
Partial HighFrequency Albedo.
While our PCAbased albedo estimation provides a complete texture map, it only captures low frequencies. To enable the analysis of finescale skin details from the singleview image, we need to extract a partial highfrequency albedo map from the input. We factor out the shading component from the input RGB image by estimating the illumination , the surface normal , and an optimized partial face geometry using the method presented in [6, 44]. To extract the partial highfrequency albedo map from the visible face regions we use an automatic facial segmentation technique [46].
4 Texture Analysis
As shown in Figure 3, we wish to extract multiscale details from the resulting highfrequency partial albedo map obtained in Section 3. These finescale details are represented by midlayer feature correlations from a deep convolutional neural network as explained in this Section. We first extract partial feature correlations from the partially visible albedo map, then estimate coefficients of a convex combination of partial feature correlations from a face database with highresolution texture maps. We use these coefficients to evaluate feature correlations that correspond to convex combinations of full highfrequency texture maps. These complete feature correlations represent the target detail distribution for the texture synthesis step in Section 5. Notice that all processing is applied on the intensity Y using the YIQ color space to preserve the overall color as in [17].
For an input image , let be the filter response of on layer . We have where is the number of channels/filters and is the number of pixels (widthheight) of the feature map. The correlation of the local structures can be represented as the normalized Gramian matrix :
We show that for a face texture, its feature response from the latter layers and the correlation matrices from former ones sufficiently characterize the facial details to ensure photorealism and perceptually identical images. A complete and photorealistic face texture can then be inferred from this information using the partially visible face in image .
As only the lowfrequency appearance is encoded in the last few layers, exploiting feature response from the complete lowfrequency albedo optimized in Sec. 3 gives us an estimation of the desired feature response for :
The remaining problem is to extract such a feature correlation (for the complete face) from a partially visible face as illustrated in Figure 3.
Feature Correlation Extraction.
A key observation is that the correlation matrices obtained from images of different faces can be linearly blended, and the combined matrices still produce realistic results. See Figure 4 as an example for matrices blended from 4 images to 256 images. Hence, we conjecture that the desired correlation matrix can be linearly combined from such matrices using a sufficiently large database.
However, the input image, denoted as , often contains only a partially visible face, so we can only obtain the correlation in a partial region. To eliminate the change of correlation due to different visibility, complete textures in the database are masked out and their correlation matrices are recomputed to simulate the same visibility as the input. We define a maskout function to remove all nonvisible pixels:
where is an arbitrary pixel. We choose 0.5 as a constant intensity for nonvisible regions. So the new correlation matrix of layer for each image in dataset is:
MultiScale Detail Reconstruction.
Given the correlation matrices derived from our database, we can find an optimal blending weight to linearly combine them to minimize its difference from observed from the input :
(2) 
Here, the Frobenius norms of correlation matrix differences on different layers are accumulated. Note that we add extra constraints to the blending weight so that the blended correlation matrix is located within the convex hull of matrices derived from the database. While a simple least squares optimization without constraints can find a good fitting for the observed correlation matrix, artifacts could occur if the observed region in the input data is of poor quality. Enforcing convexity can reduce such artifacts, as shown in Figure 5.
After obtaining the blending weights, we can simply compute the correlation matrix for the whole image:
5 Texture Synthesis
After obtaining our estimated feature and correlation from , the final step is to synthesize a complete albedo matching both aspects. More specifically, we select a set of highfrequency preserving layers and lowfrequency preserving layers and try to match and for layers in these sets, respectively.
The desired albedo is computed via the following optimization:
(3) 
where is a weight balancing the effect of high and lowfrequency details. As illustrated in Figure 6, we choose for all our experiments to preserve the details.
While this is a nonconvex optimization problem, the gradient of this function can be easily computed. can be considered as an extra layer in the neural network after layer
, and the optimization above is similar to the process of training a neural network with Frobenius norm as its loss function. Note that here our goal is to modify the input
rather than solving for the network parameters.For the Frobenius loss function , where is a constant matrix, and for Gramian matrix , their gradients can be computed analytically as follows:
As the derivative of every highfrequency and lowfrequency layer
can be computed, we can apply the chain rule on this multilayer neural network to backpropagate the gradient on preceding layers all the way to the first one, to get the gradient of input
. Given the size of variables in this optimization problem and the limitation of the GPU memory, we follow Gatys et al.’s choice [18] of using an LBFGS solver to optimize . We use the low frequency albedo from Section 3 to initialize the problem.6 Results
We processed a wide variety of input images with subjects of different races, ages, and gender, including celebrities and people from the publicly available annotated facesinthewild (AFW), dataset [43]. We cover challenging examples of scenes with complex illumination as well as nonfrontal faces. As showcased in Figures 1 and 20, our inference technique produces highresolution texture maps with complex skin tones and mesoscopicscale details (pores, stubble hair), even from very lowresolution input images. Consequentially, we are able to effortlessly produce highfidelity digitizations of iconic personalities who have passed away, such as Muhammad Ali, or bring back their younger selves (e.g., young Hillary Clinton) from a single archival photograph. Until recently, such results would only be possible with highend capture devices [55, 34, 20, 19] or intensive effort from digital artists. We also show photorealistic renderings of our reconstructed face models from the widely used AFW database, which reveal highfrequency pore structures, skin moles, as well as short facial hair. We clearly observe that lowfrequency albedo maps obtained from a linear PCA model [6] are unable to capture these details. Figure 7 illustrates the estimated shape and also compares the renderings between the lowfrequency albedo and our final results. For the renderings, we use Arnold [49], a Monte Carlo raytracer, with generic subsurface scattering, imagebased lighting, procedural roughness and specularity, and a bump map derived from the synthesized texture.
Face Texture Database.
For our texture analysis method (Section 4) and to evaluate our approach, we built a large database of highquality facial skin textures from the recently released Chicago Face Database [33] used for psychological studies. The data collection contains a balanced set of standardized highresolution photographs of individuals of different ethnicity, ages, and gender. While the images were taken in a consistent environment, the shape and lighting conditions need to be estimated in order to recover a diffuse albedo map for each subject. We extend the method described in Section 3 to fit a PCA face model to all the subjects while solving for globally consistent lighting. Before we apply inverse illumination, we remove specularities in SUV color space [35] by filtering the specular peak in the S channel since the faces were shot with flash.
Evaluation.
We evaluate the performance of our texture synthesis with three widely used convolutional neural networks (CaffeNet, VGG16, and VGG19) [5, 48] for image recognition. While different models can be used, deeper architectures tend to produce less artifacts and higher quality textures. To validate our use of all midlayers of VGG19 for the multiscale representation of details, we show that if less layers are used, the synthesized textures would become blurrier, as shown in Figure 8. While the texture synthesis formulation in Equation 3 suggests a blend between the lowfrequency albedo and the multiscale facial details, we expect to maximize the amount of detail and only use the lowfrequency PCA model estimation for initialization.
As depicted in Figure 9, we also demonstrate that our method is able to produce consistent highfidelity texture maps of a subject captured from different views. Even for extreme profiles, highly detailed freckles are synthesized properly in the reconstructed textures. Please refer to our additional materials for more evaluations.
Comparison.
We compare our method with the stateoftheart facial image generation technique, visiolization [37] and the widely used morphable face models of Blanz and Vetter [6] in Figure 11. Both ours and visiolization produce higher fidelity texture maps than a linear PCA model solution [6]. When increasing the resolution, we can clearly see that our inference approach outperforms the statistical framework of Mohammed et al. [37] with mesoscopicscale features such as pores and stubble hair, while their method suffers from random noise patterns.
Performance.
All our experiments are performed using an Intel Core i75930K CPU with GHz equipped with a GeForce GTX Titan X with 12 GB memory. Following the pipeline in Figure 2 and 3, our initial face model fitting takes less than a second, the texture analysis consists of s of partial feature correlation extraction and s of fitting with convex combination, and the the final synthesis optimization takes s for 1000 iterations.
User Study A: Photorealism and Alikeness.
To assess the photorealism and the likeness of our reconstructed faces, we propose a crowdsourced experiment using Amazon Mechanical Turk (AMT). We compare ground truth photographs from the Chicago Face Database with renderings of textures that are generated with different techniques. These synthesized textures are then composited on the original images using the estimated lighting and shading parameters. We randomly select 11 images from the database and blur them using Gaussian filtering until the details are gone. We then synthesize highfrequency textures from these blurred images using (1) a PCA model, (2) visiolization, (3) our method using the closest feature correlation, (4) our method using unconstrained linear combinations, and (5) our method using convex combinations. We show the turkers a left and right side of a face and inform them that the left side is always the ground truth. The right side has a 50% chance of being computer generated. The task consists of deciding whether the right side is “real” and identical to the ground truth, or “fake”. We summarize our analysis with the box plot in Figure 12 using turkers. Overall, (5) outperforms all other solutions and different variations of our method have similar means and medians, which indicates that nontechnical turkers have a hard time distinguishing between them.
User Study B: Our method vs. Light Stage Capture.
We also compare the photorealism of renderings produced using our method with the ones from Light Stage [19]. We use an interface on AMT that allows turkers to rank the renderings from realistic to unrealistic. We show sidebyside renderings of 3D face models as shown in Figure 13 using (1) our synthesized textures, (2) the ones from the Light Stage, and (3) one obtained using PCA model fitting [6]. We asked 100 turkers to each sort 3 sets of prerendered images, which are randomly shuffled. We used three subjects and perturbed their head rotations to produce more samples. We found that our synthetically generated details can confuse the turkers for subjects that have smoother skins, which resulted in 56% thinking that results from (1) are more realistic from (2). Also, 74% of the turkers found that faces from (2) are more realistic than from (3) and 72% think that method (1) is superior to (3). Our experiments indicate that our results are visually comparable to those from the Light Stage and that the level of photorealism is challenging to judge by a nontechnical audience.
7 Discussion
We have shown that digitizing highfidelity albedo texture maps is possible from a single unconstrained image. Despite challenging illumination conditions, nonfrontal faces, and lowresolution input, we can synthesize plausible appearances and realistic mesoscopic details. Our user study indicates that the resulting highresolution textures can yield photorealistic renderings that are visually comparable to those obtained using a stateoftheart Light Stage system. Midlayer feature correlations are highly effective in capturing highfrequency details and the general appearance of the person. Our proposed neural synthesis approach can handle highresolution textures, which is not possible with existing deep learning frameworks [13]. We also found that convex combinations are crucial when blending feature correlations in order to ensure consistent finescale details.
Limitations.
Our multiscale detail representation currently does not allow us to control the exact appearance of highfrequency details after synthesis. For instance, a mole could be generated in an arbitrary place even if it does not actually exist. The final optimization step of our synthesis is nonconvex, which requires a good initialization. As shown in Figure 7, the PCAbased albedo estimation can fail to estimate the goatee of the subject, resulting in a synthesized texture without facial hair.
Future Work.
To extend our automatic characterization of highfrequency details, we wish to develop new ways for specifying the appearance of mesoscopic distributions using highlevel controls. Next, we would like to explore the generation of finescale geometry, such as wrinkles, using a similar texture inference approach.
Acknowledgements
We would like to thank Jaewoo Seo and Matt Furniss for the renderings. We also thank Joseph J. Lim, Kyle Olszewski, Zimo Li, and Ronald Yu for the fruitful discussions and the proofreading. This research is supported in part by Adobe, Oculus & Facebook, Huawei, the Google Faculty Research Award, the Okawa Foundation Research Grant, the Office of Naval Research (ONR) / U.S. Navy, under award number N000141512639, the Office of the Director of National Intelligence (ODNI) and Intelligence Advanced Research Projects Activity (IARPA), under contract number 201414071600010, and the U.S. Army Research Laboratory (ARL) under contract W911NF14D0005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ODNI, IARPA, ARL, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purpose notwithstanding any copyright annotation thereon.
Appendix I. Additional Results
Our main results in the paper demonstrate successful inference of highfidelity texture maps from unconstrained images. The input images have mostly low resolutions, nonfrontal faces, and the subjects are often captured in challenging lighting conditions. We provide additional results with pictures from the annotated facesinthewild (AFW) dataset [43] to further demonstrate how photorealistic porelevel details can be synthesized using our deep learning approach. We visualize in Figure 20 the input, the intermediate lowfrequency albedo map obtained using a linear PCA model, and the synthesized highfrequency albedo texture map. We also show several views of the final renderings using the Arnold renderer [49]. We refer to the accompanying video for additional rotating views of the resulting textured 3D face models. ^{†}^{†}§ indicates equal contribution
Evaluation.
As Figure 14
indicates, other deep convolutional neural networks can be used to extract midlayer feature correlations to characterize multiscale details, but it seems that deeper architectures produce fewer artifacts and higher quality textures. All three convolutional neural networks are pretrained for classification tasks using images from the ImageNet object recognition dataset
[11]. The results of the 8 layer CaffeNet [5] show noticeable blocky artifacts in the synthesized textures and the ones from the 16 layer VGG [48] are slightly noisy around boundaries, while the 19 layer VGG network performs the best.We also evaluate the robustness of our inference framework for downsized image resolutions in Figure 15. We crop a diffuse lit face from a Light Stage capture [19]. The resulting image has pixels and we decrease its resolution to pixels. In addition to complex skin pigmentations, even the tiny mole on the lower left cheek is properly reconstructed from the reduced input image using our synthesis approach.
Comparison.
We provide in Figure 16 additional visualizations of our method when using the closest feature correlation, unconstrained linear combinations, and convex combinations. We also compare against a PCAbased model fitting [6] approach and the stateoftheart visiolization framework [37]. We notice that only our proposed technique using convex combinations is effective in generating mesoscopicscale texture details. Both visiolization and the PCAbased model result in lower frequency textures and less similar faces than the ground truth. Since our inference also fills holes, we compare our synthesis technique with a general inpainting solution for predicting unseen face regions. We test with the widely used PatchMatch [2] technique as illustrated in Figure 17. Unsurprisingly, we observe unwanted repeating structures and semantically wrong fillings since this method is based on lowlevel vision cues.
Appendix II. User Study Details
This section gives further details and discussions about the two user studies presented in the paper. Figures 18 and 19 also show the user interfaces that we deployed on Amazon Mechanical Turk (AMT).
User Study A: Photorealism and Alikeness.
We recall that method (1) is obtained using PCA model fitting, (2) is visiolization, (3) is our method using the closest feature correlation, (4) our method using unconstrained linear combinations, and (5) our method using convex combinations. We use photographs from the Chicago Face Database [33] for this evaluation, and downsize/crop their resolution from to pixels. At the end we apply one iteration of Gaussian filtering of kernel size 5 to remove all the facial details. Only 65.6% of the real images on the right have been correctly marked as “real”. This is likely due to the fact that the turkers know that only 50% are real, which affects their confidence in distinguishing real ones from digital reconstructions. Results based on PCA model fittings have few occurrences of false positives, which indicates that turkers can reliably identify them. The generated faces using visiolization also appear to be less realistic and similar than those obtained using variations of our method. For the variants of our method, (3), (4), and (5), we measure similar means and medians, which indicates that nontechnical turkers have a hard time distinguishing between them. However, method (4) has a higher chance than variant (3) to be marked as “real”, and the convex combination method (5) achieves the best results as they occasionally notice artifacts in (4). Notice how the left and right sides of the face are swapped in the AMT interface to prevent users from comparing texture transitions.
User Study B: Our method vs. Light Stage Capture.
We used three subjects (due to limited availability) and randomly perturbed their head rotations to produce more rendering samples. To obtain a consistent geometry for the Light Stage data, we warped our mesh to fit their raw scans using nonrigid registration [30]. All examples are rendered using fullon diffuse lighting and our input image to the inference framework has a resolution of
pixels. We asked 100 turkers to sort 3 sets of renderings, one for each of the three subjects. Surprisingly, we found that 56% think that ours are superior in terms of realism than those obtained from the Light Stage, 74% of the turkers found the results of (2) to be more realistic than (3), and 72% think that ours is superior to (3). We believe that over 20% of the turkers who believe that (3) is better than the two other methods are outliers. After removing these outliers, we still have 57% who believe that our results are more photoreal than those from the Light Stage. We believe that our synthetically generated finescale details confuse the turkers for subjects that have smoother skins in reality. Overall our experiments indicate that the performance of our method is visually comparable to ground truth data obtained from a highend facial capture device. For a nontechnical audience, it is hard to tell which of the two methods produces more photorealistic results.
References
 [1] B. Amberg, A. Blake, and T. Vetter. On compositional image alignment, with an application to active appearance models. In IEEE CVPR, pages 1714–1721, 2009.
 [2] C. Barnes, E. Shechtman, A. Finkelstein, and D. Goldman. Patchmatch: a randomized correspondence algorithm for structural image editing. ACM Transactions on GraphicsTOG, 28(3):24, 2009.
 [3] J. T. Barron and J. Malik. Shape, albedo, and illumination from a single image of an unknown object. CVPR, 2012.
 [4] T. Beeler, B. Bickel, P. Beardsley, B. Sumner, and M. Gross. Highquality singleshot capture of facial geometry. ACM Trans. on Graphics (Proc. SIGGRAPH), 29(3):40:1–40:9, 2010.
 [5] Berkeley Vision and Learning Center. Caffenet, 2014. https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet.
 [6] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194, 1999.
 [7] J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway. A 3d morphable model learnt from 10,000 faces. In IEEE CVPR, 2016.
 [8] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Facewarehouse: A 3d facial expression database for visual computing. IEEE TVCG, 20(3):413–425, 2014.
 [9] C. Cao, H. Wu, Y. Weng, T. Shao, and K. Zhou. Realtime facial animation with imagebased dynamic avatars. ACM Transactions on Graphics (TOG), 35(4):126, 2016.
 [10] P. Debevec, T. Hawkins, C. Tchou, H.P. Duiker, and W. Sarokin. Acquiring the Reflectance Field of a Human Face. In SIGGRAPH, 2000.
 [11] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 [12] R. Donner, M. Reiter, G. Langs, P. Peloschek, and H. Bischof. Fast active appearance model search using canonical correlation analysis. IEEE TPAMI, 28(10):1690–1694, 2006.
 [13] C. N. Duong, K. Luu, K. G. Quach, and T. D. Bui. Beyond principal components: Deep boltzmann machines for face modeling. In IEEE CVPR, pages 4786–4794, 2015.
 [14] G. J. Edwards, C. J. Taylor, and T. F. Cootes. Interpreting face images using active appearance models. In Proceedings of the 3rd. International Conference on Face and Gesture Recognition, FG ’98, pages 300–. IEEE Computer Society, 1998.
 [15] A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’01, pages 341–346. ACM, 2001.
 [16] A. A. Efros and T. K. Leung. Texture synthesis by nonparametric sampling. In IEEE ICCV, pages 1033–, 1999.
 [17] L. A. Gatys, M. Bethge, A. Hertzmann, and E. Shechtman. Preserving color in neural artistic style transfer. CoRR, abs/1606.05897, 2016.
 [18] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In IEEE CVPR, pages 2414–2423, 2016.
 [19] A. Ghosh, G. Fyffe, B. Tunwattanapong, J. Busch, X. Yu, and P. Debevec. Multiview face capture using polarized spherical gradient illumination. ACM Trans. Graph., 30(6):129:1–129:10, 2011.
 [20] A. Ghosh, T. Hawkins, P. Peers, S. Frederiksen, and P. Debevec. Practical modeling and acquisition of layered facial reflectance. In ACM Transactions on Graphics (TOG), volume 27, page 139. ACM, 2008.
 [21] A. Golovinskiy, W. Matusik, H. Pfister, S. Rusinkiewicz, and T. Funkhouser. A statistical model for synthesis of detailed facial geometry. ACM Trans. Graph., 25(3):1025–1034, 2006.
 [22] P. Graham, B. Tunwattanapong, J. Busch, X. Yu, A. Jones, P. Debevec, and A. Ghosh. Measurementbased Synthesis of Facial Microgeometry. In EUROGRAPHICS, 2013.
 [23] A. Haro, B. Guenterz, and I. Essay. Realtime, Photorealistic, Physically Based Rendering of Fine Scale Human Skin Structure. In S. J. Gortle and K. Myszkowski, editors, Eurographics Workshop on Rendering, 2001.
 [24] A. E. Ichim, S. Bouaziz, and M. Pauly. Dynamic 3d avatar creation from handheld video input. ACM Trans. Graph., 34(4):45:1–45:14, 2015.
 [25] V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In IEEE CVPR, pages 1867–1874, 2014.
 [26] I. KemelmacherShlizerman. Internetbased morphable model. IEEE ICCV, 2013.
 [27] I. KemelmacherShlizerman and R. Basri. 3d face reconstruction from a single image using a single reference face shape. IEEE TPAMI, 33(2):394–405, 2011.
 [28] V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: Image and video synthesis using graph cuts. In ACM SIGGRAPH 2003 Papers, SIGGRAPH ’03, pages 277–286. ACM, 2003.
 [29] C. Li, K. Zhou, and S. Lin. Intrinsic face image decomposition with human face priors. In ECCV (5)’14, pages 218–233, 2014.
 [30] H. Li, B. Adams, L. J. Guibas, and M. Pauly. Robust singleview geometry and motion reconstruction. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2009), 28(5), 2009.
 [31] H. Li, L. Trutoiu, K. Olszewski, L. Wei, T. Trutna, P.L. Hsieh, A. Nicholls, and C. Ma. Facial performance sensing headmounted display. ACM Transactions on Graphics (Proceedings SIGGRAPH 2015), 34(4), July 2015.
 [32] C. Liu, H.Y. Shum, and W. T. Freeman. Face hallucination: Theory and practice. Int. J. Comput. Vision, 75(1):115–134, 2007.
 [33] D. S. Ma, J. Correll, and B. Wittenbrink. The chicago face database: A free stimulus set of faces and norming data. Behavior Research Methods, 47(4):1122–1135, 2015.
 [34] W.C. Ma, T. Hawkins, P. Peers, C.F. Chabert, M. Weiss, and P. Debevec. Rapid Acquisition of Specular and Diffuse Normal Maps from Polarized Spherical Gradient Illumination. In Eurographics Symposium on Rendering, 2007.
 [35] S. P. Mallick, T. E. Zickler, D. J. Kriegman, and P. N. Belhumeur. Beyond lambert: Reconstructing specular surfaces using color. In IEEE CVPR, pages 619–626, 2005.
 [36] I. Matthews and S. Baker. Active appearance models revisited. Int. J. Comput. Vision, 60(2):135–164, 2004.
 [37] U. Mohammed, S. J. D. Prince, and J. Kautz. Visiolization: Generating novel facial images. In ACM SIGGRAPH 2009 Papers, pages 57:1–57:8. ACM, 2009.
 [38] K. Nagano, G. Fyffe, O. Alexander, J. Barbič, H. Li, A. Ghosh, and P. Debevec. Skin microstructure deformation with displacement map convolution. ACM Transactions on Graphics (Proceedings SIGGRAPH 2015), 34(4), 2015.
 [39] K. Olszewski, J. J. Lim, S. Saito, and H. Li. Highfidelity facial and speech animation for vr hmds. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2016), 35(6), December 2016.
 [40] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In IEEE CVPR, 2016.
 [41] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A 3d face model for pose and illumination invariant face recognition. In Advanced video and signal based surveillance, 2009. AVSS’09. Sixth IEEE International Conference on, pages 296–301. IEEE, 2009.
 [42] R. Ramamoorthi and P. Hanrahan. An efficient representation for irradiance environment maps. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 497–500. ACM, 2001.
 [43] D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In IEEE CVPR, pages 2879–2886, 2012.
 [44] S. Romdhani. Face image analysis using a multiple features fitting strategy. PhD thesis, University of Basel, 2005.
 [45] S. Romdhani and T. Vetter. Estimating 3d shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In CVPR (2), pages 986–993, 2005.
 [46] S. Saito, T. Li, and H. Li. Realtime facial segmentation and performance capture from rgb input. In ECCV, 2016.
 [47] F. Shi, H.T. Wu, X. Tong, and J. Chai. Automatic acquisition of highfidelity facial performances using monocular videos. ACM Trans. Graph., 33(6):222:1–222:13, 2014.
 [48] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [49] Solid Angle, 2016. http://www.solidangle.com/arnold/.
 [50] S. Suwajanakorn, I. KemelmacherShlizerman, and S. M. Seitz. Total moving face reconstruction. In ECCV, pages 796–812. Springer, 2014.
 [51] The Digital Human League. Digital Emily 2.0, 2015. http://gl.ict.usc.edu/Research/DigitalEmily2/.
 [52] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Realtime face capture and reenactment of rgb videos. In IEEE CVPR, 2016.
 [53] M. Turk and A. Pentland. Eigenfaces for recognition. J. Cognitive Neuroscience, 3(1):71–86, 1991.

[54]
L.Y. Wei and M. Levoy.
Fast texture synthesis using treestructured vector quantization.
In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, pages 479–488, 2000.  [55] T. Weyrich, W. Matusik, H. Pfister, B. Bickel, C. Donner, C. Tu, J. McAndless, J. Lee, A. Ngan, H. W. Jensen, and M. Gross. Analysis of human faces using a measurementbased skin reflectance model. ACM Trans. on Graphics (Proc. SIGGRAPH 2006), 25(3):1013–1024, 2006.
 [56] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment across large poses: A 3d solution. CoRR, abs/1511.07212, 2015.
Comments
There are no comments yet.