Hair is a critical component of human subjects. Rendering virtual 3D hair models into realistic images has been long studied in computer graphics, due to the extremely complicated geometry and material of human hair. Traditional graphical rendering pipeline simulates every aspect of natural hair appearance, including surface shading, light scattering, semi-transparent occlusions and soft shadowing by leveraging physics-based shading models of hair fibers, global illumination rendering algorithms to capture mutual interactions between fibers and the environment, and artistically designed material parameters. Given the extreme complexity of the geometry and associated lighting effects of hair, such a direct approximation of physical hair appearance requires a highly detailed 3D model, carefully tuned material parameters, and huge amount of rendering computation, which are often too costly and unaffordable for interactive application scenarios that require efficient feedback and user-friendly interactions, such as games and real-time photo editing software.
With the recent advances in generative adversarial networks, it becomes natural to formulate hair rendering as a special case of the conditional image generation problem, with the hair structure controlled by the 3D model, while realistic appearance is synthesized by a neural network. In the context of image-to-image translation, one of the major challenges is how to bridge both the source and target domains for proper translation. Most existing hair generation methods fall into the supervised category, which demands enough training image pairs to provide direct supervision. For example, sketch-based hair generation methods[lee2019maskgan, jo2019sc, qiu2019two] construct training pairs by synthesizing user sketches from real images. While a number of such methods were introduced, rendering 3D hair models with the help of neural networks did not receive similar treatment. The existing work on the topic [wei2018real] requires real and fake domains considerably overlap, such that the common structure is present in both domains. This is achieved at the cost of a very complicated strand-level high quality model, and allows for extracting edge and orientation maps from rendered hair strands, which serve as the common representations of hair structures between real photos and fake models. However, preparing such a high-quality strand-level hair model is itself an expensive and non-trivial problem even for a professional artist, which significantly restricts the application scope of this method.
In this paper, we propose a generic neural-network-based hair rendering pipeline that provides efficient and realistic rendering of a generic low-quality 3D hair model borrowing the material features extracted from an arbitrary reference hair image. Instead of using a complicated strand-level models to match real-world hairs like[wei2018real], we allow users to use any type of hair model requiring only the isotropic structure of hair strands be properly represented. Particularly, we adopt sparse polygon strip meshes which are much more widely used in interactive applications [ward2007survey]. Given the dramatic difference between such a coarse geometry and real hair, we are not able to design common structure representations at the model level. Therefore, supervised image translation methods will be infeasible due to the lack of paired data.
To bridge the domain of real hair images and low-quality virtual hair models in an unsupervised manner, we propose to construct a shared latent space between both real and fake domains, which encodes a common structural representation even if inputs from different domains are totally different, and render the realistic hair image from this latent space with the appearance conditioned by an extra input. This is achieved by: 1) different domain structure encodings are used as the network inputs, to pre-disentangle geometric structure and chromatic appearance for both real hair images and 3D models; 2) a UNIT [liu2017unsupervised]-like architecture is adopted to enable common latent space by partially sharing encoder weights between two auto-encoder branches that are trained with in-domain supervision; 3) a structure discriminator is introduced to further match the distribution of the encoded structure features; 4) supervised reconstruction is enforced on both branches to guarantee all necessary structure information is kept in the shared feature space. In addition, to enable temporally-smooth animation rendering, we introduce a simple yet effective temporal condition method with single image training data only, utilizing the exact hair model motion fields. We demonstrate the effectiveness of the pipeline and each key component by extensively testing on a large amount of diverse human portraits and various hair models. We also compare our method with general unsupervised image translation methods, and show that due to the limited sampling ability on the synthetic hair domain, all existing methods fail to produce convincing results.
2 Related Work
Conditional neural hair rendering belongs to a wide range of problems tackling portrait manipulation and editing. A number of methods in the literature address this cross-domain generation problem such as paired and unpaired image-to-image translation and style transfer.
Image-to-image translation aims at converting images from one domain to another while keeping the structure of the source image unchanged. The literature contains a number of various methods performing this task in a variety of settings. Paired image-to-image translation methods [isola2017image, wang2018high] operate when pairs of images in the source and the target domains are available. Such methods, for example, translate from semantic labels to scenes [wang2018high, park2019semantic, chen2017photographic], from edges to objects [sangkloy2017scribbler]
, and perform image super-resolution[ledig2017photo, johnson2016perceptual]. However, paired data are not always available in many practical applications. Unsupervised image-to-image translation tackles a setting in which paired data is not available, while sampling from two domains is possible [liu2019few, taigman2016unsupervised, zhu2017unpaired, dundar2018domain, shrivastava2017learning, liu2017unsupervised, huang2018multimodal]. Clearly, unpaired image-to-image translation is an ill-posed problem for there are numerous ways an image can be transformed to a different domain. Hence, recently proposed methods introduce constraints to limit the number of possible transformations. Some studies enforce certain domain properties [bousmalis2017unsupervised, shrivastava2017learning], while other concurrently introduced works apply cycle-consistency to transform images between different domains, such as horse to zebra, day-to-night [yi2017dualgan, zhu2017unpaired, kim2017learning]. Our work differs from existing studies that we focus on a specific challenging problem, which is the realistic hair generation, where we want to translate manually designed hair models from the domain of rendered images to the domain of real hair. For the purpose of controllable hair generation, we leverage rendered hair structure and arbitrary hair appearance to synthesize diverse realistic hair styles. Further difference of our work compared to the image-to-image translation papers is unbalanced data. The domain of images containing real hair is far more diverse than that of rendered hair, making it challenging for classical image-to-image translation works to address the problem.
Neural style transfer and manipulation is related to image-to-image translation in a way that image style is changed while content is maintained [chen2016fast, gatys2016image, huang2017arbitrary, li2016precomputed, li2017diversified, li2017demystifying, ulyanov2016texture, hertzmann2001image]. Style in this case is represented by unique style of an artist [gatys2016image, ulyanov2016texture] or is copied from an example image provided by the user. Our work follows the research idea from example-guide style transfer that hair style is obtained from reference real image. However, instead of changing style of a whole image, our aim is to keep the appearance of human face and background unchanged, while having full control over the hair region. Therefore, instead of following exiting works that inject style features into image generation networks directly [huang2017arbitrary, park2019semantic], we propose a new architecture that combines only hair appearance features and latent features that encodes image content and adapted hair structure for image generation. This way we can achieve the goal that only the style of the hair region is manipulated according to provided exemplar image.
Domain Adaptation addresses the domain-shift problem that is widely exists between source and target domains [saenko2010adapting]. Various feature based methods have been proposed to tackle the problem [kulis2011you, gong2012geodesic, gopalan2011domain, fernando2013unsupervised, tzeng2014deep]. Recent works on adversarial learning for the embedded feature alignment between source and target domains achieves better results than previous studies [ganin2014unsupervised, ganin2016domain, liu2016coupled, tsai2018learning, hoffman2017cycada, tzeng2017adversarial]. Efforts using domain adaptation for both classification and pixel-level prediction tasks have gained significantly progress [bousmalis2017unsupervised, chen2017no, tsai2018learning]. In this work, we follow the challenging setting of unsupervised domain adaptation that there is no corresponding annotation between source and target domains. We aim at learning an embedding space that only contains hair structure information for both rendered and real domain. Considering the domain gap, instead of using original images as input, we use rendered and real structure map as inputs to the encoders, which contain both domain specific layers and shared layers, to obtain latent features. The adaptation is achieved by adversarial training and image reconstruction.
Hair Rendering and Generation share a similar goal with our paper, which is synthesizing photo-realistic hair images. Traditional graphical hair rendering methods focus on improving rendering quality and performance by either more accurately modeling the special hair material and lighting behaviours [marschner2003light, moon2006simulating, deon2011energy, yan2015physically], or approximating certain aspects of rendering pipeline to reduce the computation complexity [zinke2008dual, moon2008efficient, sadeghi2010artist, ren2010interactive, xu2011interactive]. However, the extremely huge computation cost for realistic hair rendering usually prohibits them to be directly applied in real-time applications. Utilizing the latest advances in GANs, recent works [lee2019maskgan, jo2019sc, qiu2019two] achieved impressive progress on conditioned hair image generation as supervised image-to-image translation. A GAN-based hair rendering method [wei2018real] proposes to perform conditioned 3D hair rendering by starting with a common structure representation and progressively enforce various conditions. However, it requires the hair model to be able to generate consistent representation (strand orientation map) with real images, which is challenging for low-quality mesh-based models, and cannot achieve temporally smooth results.
Despite recent progress on conditional hair generation and image-to-image translation, the problem of realistic spatio-temporal rendering of low-quality 3D hair models remains largely unaddressed. In this paper we provide the necessary treatment of the problem achieving photo-realistic renderings as well as reach temporal stability of the rendered hair.
Problem Formulation. Let be the target 3D hair model, with camera parameters and hair material parameters , we formulate the traditional graphic rendering pipeline as . Likewise, our neural network-based rendering pipeline is defined as , with a low-quality hair model and material features extracted from an arbitrary reference hair image .
3.1 Overview of Network Architecture
The overall system pipeline is shown in Fig.1, which consists of two parallel branches for both domains of real photo (i.e., real) and synthetic renderings (i.e., fake), respectively.
On the encoding side, the structure adaptation subnetwork, which includes a real encoder and a fake encoder , achieves cross-domain structure embedding . Similar to UNIT[liu2017unsupervised], we share the weights of the last few ResNet layers in and to extract consistent structural representation from two domains. In addition, a structure discriminator is introduced to match the high-level feature distributions between two domains to enforce the shared latent space further to be domain invariant.
On the decoding side, the appearance rendering subnetwork, which consists of and for the real and fake domain respectively, is attached after the shared latent space to reconstruct the images in the corresponding domain. Each decoder owns its exclusive domain discriminator and to ensure the reconstruction matches the domain distribution, besides the reconstruction losses. The hair appearance is conditioned in a asymmetric way that accepts extra condition of material features extracted from a reference image by using material encoder , while the unconditional decoder is asked to memorize the appearance, which is made on purpose for training data generation (Sec.4.1).
At the training stage, all these networks are jointly trained using two sets of image pairs for both real and fake domains, where represents a domain-specific structure representation of the corresponding hair image in this domain. Both real and fake branches try to reconstruct the image from its paired structure image independently through their own encoder-decoder networks, while the shared structural features are enforced to match each other consistently by the structure discriminator . We set the appearance reference in the real branch to fully reconstruct in a paired manner.
At the inference stage, only the fake branch encoder and the real branch decoder are activated. generates the final realistic rendering using structural features encoded by on the hair model. The final rendering equation can be formulated as:
where the function renders the structure encoded image of the model in camera setting .
3.2 Structure Adaptation
The goal of the structure adaptation subnetwork, formed by the encoding parts of both branches, is to encode cross-domain structural features to support final rendering. Since the inputs to both encoders are manually disentangled structure representation (Sec.4.1), the encoded features only contain structural information of the target hair. Moreover, as the appearance information is either conditioned by extra decoder input in a way that non spatial-varying structural information is leaked (the real branch) or simple enough to be memorized by the decoder (the fake branch) (Sec.3.3), the encoded features should also include all the structural information necessary to reconstruct .
and share a similar network structure: five downsampling convolution layers followed by six ResBlks. The last two ResBlks are weight-sharing to enforce the shared latent space. follows PatchGAN[isola2017image] to distinguish between the latent feature maps from both domains:
3.3 Appearance Rendering
The hair appearance rendering subnetwork decodes the shared cross-domain hair features into the real domain images. The decoders and have different network structures and do not share weights since the neural hair rendering is a unidirectional translation that aims to map the rendered 3D model in the fake domain to real images in the real domain. Therefore, is required to make sure the latent features encode all necessary information from the input 3D model, instead of learning to render various appearance. On the other hand, is designed in a way to accept arbitrary inputs for realistic image generation.
Specifically, the unconditional decoder starts with two ResBlks, and then five consecutive upsampling transposed convolutional layers followed by one final convolutional layer. adopts a similar structure as , with each transposed convolutional layer replaced with a SPADE[park2019semantic] ResBlk to use appearance feature maps at different scales to condition the generation. Assuming the binary hair mask of the reference and the target images are and , the appearance encoder
extracts the appearance feature vector onwith five downsampling convolutional layers and an average pooling. This feature vector is then used to construct the feature map by duplicating it spatially in the target hair mask as follows:
To make sure the reconstructed real image and the reconstructed fake image belong to their respective distributions, we apply two domain specific discriminator and for the real and fake domain respectively. The adversarial losses write as:
We also adopt perceptual losses to measure high-level feature distance utilizing the paired data:
where computes the activation feature map of input image at the th selected layer of VGG-19[simonyan2015very]
pre-trained on ImageNet[russakovsky2015imagenet].
Finally, we have the overall training objective as:
3.4 Temporal Conditioning
The aforementioned rendering network is able to generate plausible single-frame results. However, despite the hair structure is controlled by smoothly-varying inputs of with the appearance conditioned by a fixed feature map , the spatially-varying appearance details are still generated in a somewhat arbitrary manner which tends to flicker in time (Fig.5). Fortunately, with the availability of the 3D model, we can calculate the exact hair motion flow for each pair of frames and , which can be used to warp image from to as . We utilize this dense correspondences to enforce temporal smoothness.
Let be the generated result sequence, we achieve this temporal conditioning by simply using the warped result of the previous frame as an additional condition, stacked with the appearance feature map , to the real branch decoder when generating the current frame .
We make the network temporally consistent by changing the real branch decoder only. Specifically, we temporally finetune it. During temporal training, we fix all other networks and use the same objective as Eq.7, but randomly ( of chance) concatenating into the condition inputs to the SPADE ResBlks of . The generation pipeline of the real branch now becomes , so that the network learns to preserve the temporal consistency if the previous frame is inputted as the temporal condition, or generate randomly from scratch if the temporal condition is set to zero.
Finally, we have the rendering equation for sequential generation:
In this section, we show the experimental results of our proposed neural hair rendering and demonstrate its superiority over existing state-of-the-art works.
|3D Hair||Input||Fake Hair||Structure||Real Hair||Structure|
|Fake Domain||Real Domain|
4.1 Data Preparation
To train the proposed framework, we generate a dataset that includes image pairs for both real and fake domains. In each domain, indicates the mapping from structure to image, where encodes only the structure information, and is the corresponded image that conforms to the structure condition.
Real Domain. We adopt the widely used FFHQ[karras2019style] portrait dataset to generate the training pairs for the real branch, given it contains diverse hairstyles on shapes and appearances. To prepare real data pairs, we use original portrait photos from FFHQ as , and generate to encode only structure information from hair. However, obtaining is a non-trivial process since hair image also contains material information, besides structural knowledge. To fully disentangle structure and material, and construct a universal structural representation of all real hair, we apply a dense pixel-level orientation map in the hair region, which is formulated as , calculated with oriented filter kernels [paris2008hair]. Thus, we can obtain that only consists of local hair strand flow structures. Example generated pairs are presented in Fig.2b.
For the purpose of training and validation, we randomly select images from FFHQ as training, and use the remaining images for testing. For each image , we perform hair segmentation using off-the-shelf model, and calculate for the hair region.
Fake Domain. There are multiple ways to model and render virtual hair models. From coarse to fine, typical virtual hair models range from a single rigid shape, coarse polygon strips representing detached hair wisps, to large amount of thin hair fibers that mimic real-world hair behaviors. Due to various granularity of the geometry, the structural representation is hardly shared with each other or real hair images. In our experiments, all the hair models we used are polygon strips based considering this type of hair model is widely adopted in real-time scenarios for it is efficient to render and flexible to be animated. To generate for a given hair model and specified camera parameters , we use smoothly varying color gradient as texture to render into a color image that embeds the structure information of the hair geometry, such that . As for , we use traditional graphic rendering pipeline to render with a uniform appearance color and simple diffuse shading, so that the final synthetic renderings have consistent appearance that can be easily disentangled without any extra condition, and keep all necessary structural information to verify the effectiveness of the encoding step. Example pairs are shown in Fig.2a.
For the 3D hair used for fake data pairs, we create five models (leftmost column in Fig.2), including the middle hairstyle , the short punky hairstyle , the short hairstyle with two buns , the long hairstyle , and the long hairstyle with twin tails . The first four models are used for training, and the twin-tail hair model is used to evaluate the generalization capability of the network, for the network has never seen it. All these models consist of to polygon strips, which is sparse enough for real-time applications. We use the same training set from real domain to form training pairs. Each image is overlaid by one of the four 3D hair models according to the head position and pose. Then the image with fake hair model is used to generate through rendering the hair model with simple diffuse shading, and by exporting color textures that encodes surface tangent of the mesh. We strictly use the same shading parameters, including lighting and color, to enforce a uniform appearance of hair that can be easily disentangled by the networks.
4.2 Implementation Details
We apply a two-stage learning strategy. During the first stage, all networks are trained jointly following Eq.7 for the single-image renderer . After that, we temporally fine-tune the decoder of the real branch, to achieve temporally-smooth renderer , by introducing the additional temporal condition as detailed in Sec.3.4. To make the networks of both stages consistent, we keep the same condition input dimensions, including appearance and temporal, but set the temporal condition to zero during the first stage. During the second stage we set it to zero with of chance. The network architecture discussed in Sec.3
is implemented using PyTorch. We adopt Adam solver with a learning rate set tofor the first stage, and for the fine-tuning stage. The training resolution of all images is , with the mini-batch size set to
. For the loss functions, weights, , and are set to , , and , respectively. All experiments are conducted on a workstation with Nvidia Tesla P100 GPUs.
4.3 Qualitative Results
We present visual hair rendering results from two settings in Fig.3. The left three columns in Fig.3 show that the reference image is the same as . By applying a hair model, we can modify human hair shape but keep the original hair appearance and orientation. The right four columns show that the reference image is different from , therefore, both structure and appearance of hair from can be changed at the same time to render the hair with new style. These flexible applications demonstrate that our method can be easily applied to modify hair and generate novel high-quality hair images.
4.4 Comparison Results
To the best of our knowledge, there is no previous work that tackles the problem of neural hair rendering; thus, a direct comparison is not feasible. However, in light of our methods aim to bridge two different domains without ground-truth image pairs, which is related to unsupervised image translation, we compare our network with state-of-the-art unpaired image translation studies. It is important to stress that although our hair rendering translation falls into the range of image translation problems, there exist fundamental differences compared to the generic unpaired image translation formulations for the following two reasons.
First and foremost, compared with translation between two domains, such as painting styles, or seasons/times of the day, which have roughly the same amount of images for two domains and enough representative training images can be sampled to provide nearly-uniform domain coverage, our real/fake domains have dramatically different sizes–it is easy to collect a huge amount of real human portrait photos with diverse hairstyles to form the real domain. Unfortunately, for the fake domain, it is impossible to reach the same variety since it would require manually designing every possible hair shape and appearance to describe the distribution of the whole domain of rendered fake hair. Therefore, we focus on a realistic assumption that only a limited set of such models are available for training and testing, such that we use four 3D models for training and one for testing, which is far from being able to produce variety in the fake domain.
Second, as a deterministic process, hair rendering should be conditioned strictly on both geometric shape and chromatic appearance, which can be hardly achieved with unconditioned image translation frameworks.
With those differences bearing in mind, we show the comparison between our method and three unpaired image translation studies, including CycleGAN [zhu2017unpaired], DRIT [lee2018diverse], and UNIT [liu2017unsupervised]. For the training of these methods, we use the same sets of images, and
, for both real and fake domains, and the default hyperparameters reported by the original papers. Additionally, we compare with the images generated by traditional graphic rendering pipeline. We denote the method asGraphic Renderer. Finally, we report two ablation studies to valuate the soundness of the network and the importance of each step: 1) we first remove the structural discriminator (termed as w/o SD); 2) we then additionally remove the shared latent space (termed as w/o SL and SD).
|w/o SL and SD||94.25||80.10||93.89|
4.4.1 Quantitative comparison.
For quantitative evaluation, we adopt FID (Fréchet Inception Distance) [heusel2017gans] to measure the distribution distance between two domains. Moreover, inspired by the evaluation protocol from existing work [chen2017photographic, wang2018high], we apply a pre-trained hair segmentation model [svanera2016figaro] on the generated images to get hair mask, and compare it with the ground truth hair mask. Intuitively, the segmentation model should predict the hair mask that similar to the ground-truth for the realistic synthesized images. To measure the segmentation accuracy, we use both Intersection-over-Union (IoU) and pixel accuracy (Accuracy).
The quantitative results are reported in Tab.1
. Our method significantly outperforms the state-of-the-art unpaired image translation works and graphic rendering approach by a large margin for all the three evaluation metrics. The low FID score proves our method can generate high-fidelity hair images that contain similar hair appearance distribution as images from the real domain. The high IoU and Accuracy demonstrate the ability of the network to minimize structure gap between real and fake domain so that the synthesized images can follow the manually designed structure. Furthermore, the ablation analysis in Tab.1 shows both shared encoder layers and the structural discriminator are essential parts of the network, for the shared encoder layers help the network to find a common latent space that embeds hair structural knowledge, while the structural discriminator forces the hair structure features to be domain invariant.
4.4.2 Qualitative comparison.
The qualitative comparison of different methods is shown in Fig.4. It can be easily seen that our generated images have much higher quality than the synthesized images created by other state-of-the-art unpaired image translation methods, for they have clearer hair mask, follow hair appearance from reference images, maintain the structure from hair models, and look like natural hair. Compared with the ablation methods (Fig.4c and d), our full method (Fig.4b) can follow the appearance from reference images (Fig.4a) by generating hair with similar orientation.
We also show the importance of temporal conditioning (Sec.3.4) in Fig.5. The temporal conditioning helps us generate consistent and smooth video results, for hair appearance and orientation are similar between continuous frames. Without temporal conditioning, the hair texture could be different between frames, as indicated by blue and green boxes, which may result in flickering for the synthesized video. Please refer to the supplementary video for more examples.
We propose a neural-based rendering pipeline for general virtual 3D hair models. The key idea of our method is that instead of enforcing model-level representation consistency to enable supervised paired training, we relax the strict requirements on the model and adopt a unsupervised image translation framework. To bridge the gap between real and fake domains, we construct a shared latent space to encode a common structure feature space for both domains, even if their inputs are dramatically different. In this way, we can encode a virtual hair model into such a structure feature, and switch it into the real generator to produce realistic rendering. The conditional real generator not only allow flexible condition of hair appearance, but can also be used to introduce an extra temporal conditioning to generate smooth sequential results.