This paper introduces DCT-Net, a novel image translation architecture for few-shot portrait stylization. Given limited style exemplars (∼100), the new architecture can produce high-quality style transfer results with advanced ability to synthesize high-fidelity contents and strong generality to handle complicated scenes (e.g., occlusions and accessories). Moreover, it enables full-body image translation via one elegant evaluation network trained by partial observations (i.e., stylized heads). Few-shot learning based style transfer is challenging since the learned model can easily become overfitted in the target domain, due to the biased distribution formed by only a few training examples. This paper aims to handle the challenge by adopting the key idea of "calibration first, translation later" and exploring the augmented global structure with locally-focused translation. Specifically, the proposed DCT-Net consists of three modules: a content adapter borrowing the powerful prior from source photos to calibrate the content distribution of target samples; a geometry expansion module using affine transformations to release spatially semantic constraints; and a texture translation module leveraging samples produced by the calibrated distribution to learn a fine-grained conversion. Experimental results demonstrate the proposed method's superiority over the state of the art in head stylization and its effectiveness on full image translation with adaptive deformations.READ FULL TEXT VIEW PDF
Portrait stylization, an essential part of digital art, aims to transform natural persons’ appearances into more creative interpretations in desired visual styles while maintaining personal identity. It changes source portraits with beautified or exaggerated effects in a fantastic way and has enormous potential applications including art creation, animation making, and virtual avatar generation. However, creating artistic portraiture is skill-restrictive and requires substantial human labors for image creation and arrangement.
With the rapid development of Generative Adversarial Networks (GANs)(Goodfellow et al., 2014; Mirza and Osindero, 2014), image-to-image translation methods (Isola et al., 2017; Wang et al., 2018b) have been introduced to automatically learn a function that maps images from one domain to the other. Due to the unavailability of paired data, existing methods (Zhu et al., 2017; Chen et al., 2018, 2019; Kim et al., 2020) mainly utilize cycle consistency to learn a translation from source photos to cartoonized results. However, these methods still require a large amount of unpaired data and easily suffer from notable texture artifacts in complex scenes. Recently, stylizing faces by leveraging the pre-trained StyleGAN (Karras et al., 2019, 2020) has gained intensive attention (Pinkney and Adler, 2020; Richardson et al., 2021; Song et al., 2021). Compared to previous conditional generative models, they make full use of the powerful generative capability of unconditional StyleGAN, thus producing high-quality portraits with limited style exemplars. Due to the nature of unconditional generation, they typically learn a cartoon generator to map random noises to cartoon images, and combine inversion algorithms using optimization (Abdal et al., 2019, 2020; Creswell and Bharath, 2018) or learning based methods (Perarnau et al., 2016; Bau et al., 2019a) to project real photos into latent codes in the StyleGAN space, thus achieving the photo-to-cartoon translation. Despite high-quality results produced, these methods often suffer from the content missing problem owing to the limited generalization ability of arbitrary out-of-domain faces. Moreover, all these existing methods are tailored for heads and can not handle full-body images.
The aim of this paper is to propose a new and effective method for portrait stylization, which can simultaneously achieve advanced ability to synthesize high-preserving contents, strong generality to handle complicated real-world scenes, and high scalability to transfer various styles. As depicted in Figure 1, given a small amount of style exemplars (100), our method can translate arbitrary real faces to artistic portraits in corresponding styles (e.g., 3D-cartoon, anime, and hand-drawn), even full-body images can be properly processed with adaptive deformations (e.g., exaggerated facial features and faithful body textures). Due to the insufficient and partial observation (only head regions) of style exemplars as well as the diversity of real-world scenes, it is challenging to achieve the goal mentioned above. Rethinking the essence of the task, it actually tries to learn cross-domain correspondences from the diverse source distribution to the biased target distribution formed by only a few training examples, as illustrated in Figure 2. The learned model can easily become overfitted in the target domain and thus generating unsatisfactory style transferring results.
The key insights of this paper are threefold. First, the “calibration first, translation later” strategy makes it easier to learn stable cross-domain translation and produce high-fidelity results. Second, the balanced source distribution can be used as a prior to calibrate the biased content distribution of the target domain. Third, releasing spatially semantic constraints via geometry expansion leads to more flexible and wider-range inference. To this end, we propose a simple yet effective solution for “domain-calibrated translation”, which firstly calibrates the content features of the target distribution by adapting the learned source generator to the target domain (i.e., borrowing the powerful content prior from source). Then, the domain features are further enriched using a geometry expansion module. With these calibrated distributions, an adequate number of diverse examples depicting non-local correlation can be produced, and we train a U-net, a network with strong local behavior, to perform the cross-domain translation. This design makes our method be capable of learning the augmented global structure with locally-focused translation and brings all-around improvements. Our trained model excels in not only preserving detailed contents (e.g., identity, accessories, and backgrounds), but also handling complex scenes (e.g., heavy occlusions and rare poses). It also greatly increases the translation’s generalization capabilities, allowing out of domain translations, such as full-body image translation. This brand-new task requires adaptive deformations when only trained on raw head collections. To the best of our knowledge, this is the first approach to propose the structure of “domain-calibrated translation” and show its superiority in the above aspects.
Style transfer is a kind of non-realistic rendering technique (Kyprianidis et al., 2012). Inspired by the power of CNN, (Gatys et al., 2015, 2016) opened up a new field named Neural Style Transfer (NST), which presents an optimization based method for transferring the style of a given artwork to an image. Several works that target portraits thereafter specifically achieved impressive results. (Selim et al., 2016) proposed a head portrait painting method by locally transferring the color distributions of the example painting to others. (Kaur et al., 2019) devised a method to transfer face texture from a style face image to a content face image in a photo-realistic manner. However, these methods are closely related to texture synthesis and fail to handle geometric transformations.
It aims to learn a mapping between images in two different domains. (Isola et al., 2017) first proposed a supervised image translation model with conditional GANs (Mirza and Osindero, 2014), and was later extended to synthesize high-resolution images (Wang et al., 2018b). To alleviate the difficulty of acquiring paired data, (Zhu et al., 2017) proposed a cycle-consistency loss to use unpaired data for the translation task. A number of variants (Liu et al., 2017; Huang et al., 2018; Choi et al., 2020) have been developed thereafter to adapt this framework to different scenarios. Despite the utilization of dedicated architectures designed in aforementioned methods, their abilities of generalizing to discrepant domains are restricted. There exist some works (Cao et al., 2018; Shi et al., 2019; Gong et al., 2020) that apply this framework to learn both texture and geometric styles for caricatures generation. Exaggerations learned in this task rely on local warping features, which is restricted for specific style transfer. Recently, (Kim et al., 2020) incorporated a new attention module and a new learnable normalization function for unsupervised image translation tasks, which enables performing translation for requirement of both holistic and large shape changes. Nevertheless, it still requires extensive unpaired training data and easily generates unstable results.
In order to support real image editing with pretrained GANs, a specific task known as GAN Inversion, is used to learn the natural image manifold and inversely manipulate images into the latent space of a GAN model. Generally, there are three main techniques of GAN inversion (Xia et al., 2021), i.e., projecting an image into the corresponding latent space based on learning (Zhu et al., 2016; Perarnau et al., 2016; Bau et al., 2019a), optimization (Abdal et al., 2020; Ma et al., 2019), and hybrid formulations (Bau et al., 2019b; Zhu et al., 2020). With a novel style-based architecture, StyleGAN (Karras et al., 2019, 2020) has been shown to contain a semantically rich latent space that can be used for inversion tasks. Recently, (Viazovetskyi et al., 2020) distilled StyleGAN2 into the image-to-image network in a paired way for face editing. (Richardson et al., 2021)
proposed a generic Pixel2Style2Pixel (PSP) encoder to extract the learned styles from the corresponding feature map, and can further be used to solve image-to-image translation tasks such as inpainting, super resolution, and portrait stylization.(Tov et al., 2021) designed a new encoder to facilitate higher editing quality on real images. (Pinkney and Adler, 2020)
proposed a GAN interpolation framework for controllable cross-domain image synthesis, allowing to generate the “Toonified” version of the original image. More closely related to our approach is AgileGAN(Song et al., 2021)
, which introduces an inversion-consistent transfer learning framework for high-quality stylistic portraits, A later work(Ojha et al., 2021) tried to generate stylized paintings using few-shot exemplars via cross-domain correspondence. However, all these works suffer from the content missing problem and can not tackle hard cases in real images (e.g., accessories and occlusions), due to the weakness of out-of-distribution generalization ability. In contrast, we present a novel domain-calibrated translation framework to well adapt the original training distribution.
Given a small set of target stylistic exemplars, our goal is to learn a function that maps images from the source domain to the target domain . The output image should be rendered in the similar texture style of the target exemplar , while preserving the content details (e.g., structure and identity) of the source image .
An overview of the proposed framework is shown in Figure 3. We build a sequential pipeline with the following three modules: the content calibration network (CCN), the geometry expansion module (GEM), and the texture translation network (TTN). The first module is responsible for calibrating the target distribution in the content dimension by adapting the target style from a pre-trained source generator with transfer learning. The second module further expands the geometry dimension of both source and target distributions, and provides geometry-symmetry features with different scales and rotations for the later translation. With data sampled from the calibrated distribution, our texture translation network is employed to learn cross-domain correspondences with multi-representation constraints and the local perception loss. CCN and TTN are trained independently, and only TTN is used for the final inference. In the following, we will give a detailed description for each module of our framework.
In this module, we calibrate the biased distribution of a few target samples by transferring network parameters learned from sufficient examples. Different from pervious works (Pinkney and Adler, 2020; Richardson et al., 2021; Song et al., 2021) combining StyleGAN2 (Karras et al., 2020) with inversion methods for image translation, we leverage the powerful prior from pre-trained StyleGAN2 to reconstruct the target domain with enhanced content symmetry. Starting from a StyleGAN2-based model trained on real faces (e.g., the FFHQ dataset), , a copy of , is used as initialization weights and we adapt
to generate images in the target domain. During the training phase of CCN, we fine-tune with a discriminator to ensure
and an existing face recognition model(Deng et al., 2019) to preserve the person identity between and . During the inference phase of CCN, we blend the first layers of with the corresponding layers of , which has been proven to be effective to preserve more contents of the original source domain (Pinkney and Adler, 2020). In this way, we can produce relatively content-symmetric images in source and target domains, such as and . The flowchart is displayed in Figure 4.
It is worthy of noting that we directly sample from the space and reconstruct the source and target domains in a content-symmetric way (i.e., the same for two decoding pathways). No real faces are used which need inversion embedding and lead to accumulated errors. Due to sufficient data of real-world photos, the distribution can extremely approximate the real distribution . Thus, is relatively symmetric with , making it easier to learn cross-domain correspondences between the source and target in the later stage. Oppositely, previous methods (Pinkney and Adler, 2020; Richardson et al., 2021; Song et al., 2021) typically combine StyleGAN2 with inversion methods (Abdal et al., 2020; Tov et al., 2021), mapping source images to the space or space of StyleGAN2 and leveraging this unconditional generator to synthesize the corresponding results. Therefore, it is hard to ensure that arbitrary portraits (i.e., out-of-domain images) can be embedded in the low-dimensional space or style-disentangled space, due to the “distortion-editability trade-off” illustrated in (Tov et al., 2021; Roich et al., 2021). This inversion process leads to extra identity and structure details missing for image translation tasks.
The previous module uses the source distribution as the ground-truth distribution to calibrate the target distribution. However, all images in the source domain (FFHQ) have been aligned with the standard facial position, making the network heavily rely on the positional semantics for synthesis and further limit the network’s capability to process real-world images. To release these constraints and support full-image inference stated in Section 3.5, we apply the geometry transformation to both source samples and target samples , thus producing geometry extended samples and . is performed with the random scale ratio and the random rotation angle .
The texture translation network (TTN) aims to learn cross-domain correspondences between the calibrated domains () in an unsupervised way. Although the first module can produce roughly aligned pairs by sampled noises , it fails to preserve content details due to its nature of global mapping and also cannot handle arbitrary real faces with the additional inversion error. Considering sufficient texture information in reconstructed two domains but inaccurate texture mapping between them, we introduce a mapping network with the U-net architecture (Ronneberger et al., 2015) to convert the global domain mapping to the local texture transformation, thus learning a fine-grained texture translation in the pixel level.
Due to the utilization of sufficient data of source photos, the reconstructed source distribution can extremely approximate the original source distribution (). Thus, we directly use real sources (followed by geometry expansion) and calibrated target samples (after content and geometry calibration) for symmetric translation. In this process, the symmetric features are converted from the image level to the domain level. It is also worthy of noting that the proposed TTN is trained in an unsupervised way with unpaired images. Even when using real images as inputs, no inversion method is required to produce corresponding stylized samples. The style image is randomly sampled and is only used to provide the style representation other than the ground truth, in order to get away with local optimum. It should be pointed out that we simply use the same sample for all modules in Figure 3 to make a concise and intuitive illustration of our method.
Inspired by the way of representation decomposition in (Wang and Yu, 2020), we extract the style representation from and via texture and surface decompositions, and use the discriminator to guide to synthesize in the similar style of . The style loss is computed by penalizing the distance between the style representation distributions of real stylized images and generated images:
The pre-trained VGG16 network (Simonyan and Zisserman, 2014) is used to extract the content representations from source images and generated images to ensure the content consistency. The content loss is formulated as the L1 distance between and in the VGG feature space:
To further encourage the network to produce stylized portraits with exaggerated structure deformations (such as the simplified mouth and big delicate eyes), an auxiliary expression regressor is introduced to guide the synthesis process. In other words, we inherently impulse local structure deformations by constraining the facial expression of synthetic images via , which pays more attention to the region of facial components (e.g., mouth and eyes). Specifically, consists of regression heads on top of the feature extractor , where denotes the number of expression parameters. Both and follow the PatchGAN architecture (Isola et al., 2017)
. To achieve a faster training procedure, we directly apply the learned regressor to estimate the expression scores of generated images. The facial perception loss is calculated by:
where denotes the expression parameters extracted from the source image . We set and define as the opening degrees of the left eye, the right eye, and the mouth, respectively. With the facial points extracted from , can be easily obtained by calculating the height-to-width ratio of the bounding box of specific facial components.
from the calibrated source and target domains, the texture translation model is trained with the full loss function consisting of a style term, a content term, a facial perception term, and a total-variation term:
where denotes the weight of each corresponding loss. The total-variation loss is used to smooth the generated image , which can be computed by:
where and denote horizontal and vertical directions, respectively.
Different from previous works (Kim et al., 2020; Song et al., 2021) that are limited to aligned face stylization, our model enables full-image rendering for arbitrary portrait images containing multiple faces in rotations. A common practice to achieve the aforementioned goal is to process face and background independently. As described in Figure 6, they firstly extract aligned faces from the input image and stylize all the faces one-by-one. Then, the background image is rendered with some specialized algorithms and merged with stylized faces to obtain the final result. Instead of using such complex pipeline, we found that our texture translation network can directly render stylized results from full images in one-pass evaluation. With domain-calibrated images, the network sees the entire texture contents during training, so it implicitly encodes the contextual information of the background as well as the facial appearance. Combined with the geometry expansion module, it is scale and rotation invariant against raw face processing. Since a range of the scale ratio is adopted in GEM, input images are all resized to scales that can be satisfactorily handled. We experimentally found that images with the resolution lower than can be well handled with no blur appearing in our synthesized images.
Regarding the training process of our overall network, we first train CCN with the loss function described in the supplemental material, and use its inference phase to calibrate contents. Then, GEM is directly adopted for geometric calibration without training. Finally, with calibrated domains, TTN is trained with the loss function introduced in Section 3.4. Specifically, for CCN, the weights of and are initially retrieved from the StyleGAN2 config-f FFHQ model and fine-tuned following (Karras et al., 2020). We set to blend the model. Content calibrated samples are shuffled with raw samples and they are processed by to obtain final calibrated samples . Specifically, the number of generated style samples is 10,000. For the training process of TTN, we use 10,000 images from FFHQ processed by as calibrated source photos and mixed data consisting of real images (100) and generated samples (10,000) as target exemplars . is extracted from the layer of the pre-trained VGG16 model. is trained with labeled attributes, which are computed by combining existing face landmark detectors (Zhang et al., 2016; nagadomi, ). With the learned , we adopt the Adam optimizer (Kingma and Ba, 2014) with and to train the TTN model for around 10k iterations. The learning rate is set to and is set to . The training flow and hyper-parameters involved are the same for all styles.
For source photos, we use 10,000 images from the FFHQ dataset (Karras et al., 2019) as the training data. For target exemplars, we collect several art portrait assets (e.g., 3d cartoon, hand-drawn, barbie, comic, etc.) from the Internet and each asset contains approximate 100 images for a similar style. Only the anime style asset is created by artists and other assets are randomly downloaded from the websites. For the evaluation, we use the first 5,000 images of the CelebA dataset (Lee et al., 2020) for testing.
The ability to synthesize high-preserving contents. Besides test cases in CelebA, we also validate the capability of our model by stylizing wild portrait images collected from the Internet. As shown in Figure 7, not only the global structures between the input and the output are consistent, but also the local details such as accessories, background, and identity are highly preserved.
The generality to handle complicated scenes. To verify the strong generality of our model to handle complex real-world scenes, we test our model with hard cases, which contain heavy occlusions (Figure 7 (c, d, e)) and rare poses (b). Our method shows high robustness for these cases. We also provide more results of our method with diverse inputs (e.g., different skin tones) in Figure 13.
The scalability to transfer various styles. With limited exemplars of a new style, this unified framework can be directly used to train a new style model. We show stylized results (e.g., 3D cartoon, hand-drawn, anime) produced by different style models in Figure 7 (a) and more results are provided in supplemental materials (Supp).
, we compare the synthetic results of our method with six state-of-the-art head cartoonization methods which can be categorized into two types; a) image-to-image translation based methods: CycleGAN(Zhu et al., 2017), U-GAT-IT (Kim et al., 2020); b) StyleGAN-adaption based methods: Toonify (Pinkney and Adler, 2020), PSP (Richardson et al., 2021), AgileGAN (Song et al., 2021) and Few-shot-Ada (Ojha et al., 2021). For the first four methods, the results are produced by directly using source codes or trained models released by authors. For AgileGAN, since its code and trained model are not publicly available, we directly evaluate our method using examples provided by them officially. Few-shot-Ada is an unconditional generative model and can only synthesize ¡photo, cartoon¿ pairs, with the same random noise fed into its source and adapted generators respectively. So we use the inversion algorithm in (Karras et al., 2020) to project real faces to the latent space and use their adapted generator to produce stylized results for arbitrary images (Figure 9 (c)). Considering the inversion error, we also test their method in the noise manner and use their synthesized images as arbitrary inputs for our method (Figure 9 (e, f, g)). As we can see, our method still outperforms it with more content details. Compared with other approaches, our method produces more realistic results in both content similarity and style faithfulness. The facial identity is better preserved and even detailed accessories or extra body parts are successfully synthesized. More comparison results can be found in Supp.
We evaluate the quality of our results using the Frechet Inception Distance (FID) metric (Heusel et al., 2017)
, which is a common metric to measure the visual similarity and distribution discrepancy between two sets of images. We generate stylized images from the CelebA dataset for each method, and compute their FID value from the training cartoon dataset. To further evaluate the identity similarity (ID) between generated and source images, we extract identity vectors using a pre-trained face recognition model(Wang et al., 2018a) and adopt the normalized cosine distance to measure the similarity. As shown in Table 1, our method generates not only more realistic details with the lowest FID value, but also more similar identity with the highest ID value.
|Method||FID||ID||Pref. A||Pref. B|
|CycleGAN (Zhu et al., 2017)||57.08||0.55||7.1||1.4|
|Ugatit (Kim et al., 2020)||68.40||0.58||5.0||1.5|
|Toonify (Pinkney and Adler, 2020)||55.27||0.62||3.7||4.2|
|PSP (Richardson et al., 2021)||69.38||0.60||1.6||2.5|
As portrait stylization is often regarded as a subjective task, we resort to user studies to better evaluate the performance of the proposed method. We conducted two user studies on the results in terms of the stylization effects and faithfulness to content characteristics. In the first study, participants were asked to select the best stylized images with less distorted artifacts (Pref. A). In the second study, participants were asked to point out which stylized images best preserve the corresponding contents (Pref. B). Each participant was shown 25 questions randomly selected from a question pool containing 100 examples for each study. In each question, we show an input source following by four stylized results of competing methods and ours, where the images are arranged in a random order. We receive 1,000 answers from 40 subjects in total for each study. As shown in Table 1, over 80% of our results are selected as the best in both two metrics, which proves a significant quality boost in stylization effects and faithful transfer obtained by our approach.
To verify the effectiveness of the proposed CCN, GEM, and TTN, we evaluate the performance of several variants of our method by removing each module independently. The qualitative and quantitative results are shown in Figure 10 and Table 1, respectively. Our method w/o CCN can easily suffer from texture artifacts because of overfitting. CCN brings better generalization ability for this transfer task since it improves the diversity of target samples and calibrates the target distribution closer to the original source distribution. GEM makes our model more stable to the face alignment error and impulses full translations in freely spatial conditions. It is also necessary for the application of full-body image stylization in Section 4.6. For our method w/o TTN, we use the inversion method in (Karras et al., 2020) to project real faces to the latent code and use of CCN to produce stylized results (Figure 10 (d)). Results of our method w/o TTN suffer from the content missing problem especially for arbitrary real faces out-of-domain. This stems from not only the GAN inversion error but also the function change in the domain adaption process. To prove this, we show samples produced by CCN with random noise in Figure 10 (f), and we can see that the issue is alleviated but still exists without the inversion process. This is also an inherent problem along with all StyleGAN-adaption based methods (Song et al., 2021; Ojha et al., 2021; Richardson et al., 2021). We tackle this problem with TTN and the results are shown in Figure 10 (e, f). TTN significantly improves the network’s ability of content preservation and makes it be capable of stylizing arbitrary real photos with more similar identities as the original ones.
Due to the translation network’s strong ability of content preservation, it is difficult to achieve extremely exaggerated deformations, such as simplified noses and mouths in the anime style (see Figure 11 (c)). Actually, there is a trade-off between content similarity and style faithfulness. Here, we introduce the facial perception loss to encourage large structure changes for local components (e.g., eyes, nose, and mouth) and unchanged structures for other components, thus achieving adaptive deformation for different parts (see Figure 11 (d)). It is worthy of noting that is designed exclusively for extremely exaggerated styles, the proposed method can still produce satisfactory results for undeformable styles without .
Given training samples observed only in the head region, we find that our model achieves can also achieve full-body image translation in one evaluation with a single network. We show full-body results with various styles and some random cases in Figure 12. As we can see, the proposed method works well for arbitrary images with harmonious tones and adaptive deformations (e.g., the exaggerated eyes and faithful body). More synthesis results and some failure cases of our method can be found in Supp.
We presented DCT-Net, a novel framework for stylized portrait generation, which not only makes a boost in ability, generality, and scalability for the head stylization task, but also achieves effective full-body image translation in an elegant manner. Our key idea is to calibrate the biased target domain firstly, and learn a fine-grained translation later. Specifically, the content calibration network was introduced for diverse textures and the geometry expansion module was designed to release spatial constraints. With calibrated samples produced by the above two modules, our texture translation network easily learns cross-domain correspondences with delicately designed losses. Experimental results demonstrated the superiority and effectiveness of our method. We also believed that our solution of domain-calibrated translation could inspire future investigations on image-to-image translation tasks with biased target distribution.
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4432–4441. Cited by: §1.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8296–8305. Cited by: §1, §2.3, §3.2.
IEEE transactions on neural networks and learning systems30 (7), pp. 1967–1974. Cited by: §1.
Image style transfer using convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.