Everyone is a Cartoonist: Selfie Cartoonization with Attentive Adversarial Networks

by   Xinyu Li, et al.
JD.com, Inc.

Selfie and cartoon are two popular artistic forms that are widely presented in our daily life. Despite the great progress in image translation/stylization, few techniques focus specifically on selfie cartoonization, since cartoon images usually contain artistic abstraction (e.g., large smoothing areas) and exaggeration (e.g., large/delicate eyebrows). In this paper, we address this problem by proposing a selfie cartoonization Generative Adversarial Network (scGAN), which mainly uses an attentive adversarial network (AAN) to emphasize specific facial regions and ignore low-level details. More specifically, we first design a cycle-like architecture to enable training with unpaired data. Then we design three losses from different aspects. A total variation loss is used to highlight important edges and contents in cartoon portraits. An attentive cycle loss is added to lay more emphasis on delicate facial areas such as eyes. In addition, a perceptual loss is included to eliminate artifacts and improve robustness of our method. Experimental results show that our method is capable of generating different cartoon styles and outperforms a number of state-of-the-art methods.


A3GAN: An Attribute-aware Attentive Generative Adversarial Network for Face Aging

Face aging, which aims at aesthetically rendering a given face to predic...

Image Manipulation with Natural Language using Two-sidedAttentive Conditional Generative Adversarial Network

Altering the content of an image with photo editing tools is a tedious t...

Semantically Consistent Image Completion with Fine-grained Details

Image completion has achieved significant progress due to advances in ge...

Attentive Generative Adversarial Network for Raindrop Removal from a Single Image

Raindrops adhered to a glass window or camera lens can severely hamper t...

ARGAN: Attentive Recurrent Generative Adversarial Network for Shadow Detection and Removal

In this paper we propose an attentive recurrent generative adversarial n...

Generator From Edges: Reconstruction of Facial Images

Applications that involve supervised training require paired images. Res...

Thermal Infrared Colorization via Conditional Generative Adversarial Network

Transforming a thermal infrared image into a realistic RGB image is a ch...

1 Introduction

Selfie cartoonization as an artistic form is in great demand in our daily life. The most common is served as a profile in social networks, which can catch one’s attention at once in such a humorous way and protect individual privacy simultaneously. In addition, the cartoon portraits are also widely used in online role-playing games, artistic poster designs and so on. However, as shown in Fig. 1, manually drawing a cartoon portrait is very laborious and involves substantial artistic skills even with photo editing software. Thus, how to make selfie cartoonization efficient and high-quality is an important question.

Figure 1: This figure presents some cartoon portrait artworks in different styles drawn by painters. It usually takes 2 to 3 days for such a complex manual creative process.

Existing methods attempt to do various painting styles for cartoon portrait generation. Traditional image processing methods based on sketch extracting [1]

with little postprocessing on colors or shapes have been widely applied in smartphone software. These approaches often need to design dedicated algorithms for specific styles, and the quality of these synthetic results is far from satisfactory on a fine-grained level. The recent emergence of deep convolutional neural networks 

[2] has provided some attractive solutions for domain transfer. Neural Style Transfer (NST) [3] is one of them, which is able to transfer the artistic style of an image to a target image while keep the content of the target image. Since NST is designed for general cases, it lacks the ability to focus on some special area for tasks such as cartoonization. There are another family of methods based on Generative Adversarial Networks (GAN) [4]

that perform domain transfer in an adversarial manner. Some image2image translation methods (e.g., pix2pix 

[5], Bicycle [6]) are proposed to map the image from one domain to another. However, these methods require paired images, which are difficult to obtain for many tasks. Thanks to a series of unsupervised domain transfer frameworks proposed (e.g., CycleGAN [7], UNIT [8]), we are able to train a model with unpaired data. There are already some existing methods (e.g., CartoonGAN [9], DAGAN [10]) performing cartoonization based on unsupervised GAN framework, but they usually fail to capture the delicate facial parts or generate pleasing results.

According to our observation and analysis, we can find there exist three challenges in producing acceptable quality cartoon portraits. First, there are no public paired datasets on human selfies and cartoon portraits. Second, we do not know how to keep a cartoon style which includes abstract and simplified textural features. Last, it is difficult to generate delicate facial features when performing style transfer for portraits.

In this paper, we propose scGAN, a dedicated GAN designed for selfie cartoonization task. To address the above challenges, we first apply cycle consistency loss to solve unpaired training data. Then we proposed three simple yet effective loss functions, which can generate cartoon portraits from the overall outline to local details and keep the artistic form of the cartoon domain. The main contributions of this paper are as follows:

  • We present scGAN, a novel selfie cartoonization method based on GAN, that can learn the mapping from real-world selfies to cartoon portraits in different styles.

  • We propose to utilize attention mechanism for cartoon portrait generation task based on the domain knowledge. It can be divided into two important aspects. On a pixel level, we introduce a total variation loss which forces the networks to only highlight important edge and content information. On a regional level, we propose an attentive cycle loss acting on target regions to generate more detailed facial features. In addition, we apply perceptual loss to improve object preservation ability and convergence properties during training.

  • We construct and release a new dataset for the cross-domain image generation task, which contains plenty of high-quality and different style cartoon portraits. It could play an important role in helping painters for selfie cartoonization by using a deep learning method.

2 Related Work

2.1 Traditional Image Processing Methods

Traditional methods are basically based on oversimplified mathematical theoretical models, such as [11, 12]. These methods often apply face parsing methods to segment out each facial component, then use non-photorealistic rendering [13] method or simple filtering processing to obtain cartoon images. Based on these methods, there are various image cartoonization APPs on our mobile phone, such as MomanCamera, Cartoon Camera Photo Editor. Though achieving the goal of real-time processing in these APPs, they failed to generate detailed facial parts.

2.2 Deep Learning Methods

NST has received considerable attention in artistic creation. But for our task, selfies usually contain complex and multiple facial structures. Generating cartoon portraits by this method would be more problematic due to the information separation of styles and contents. GAN is another popular method for domain transfer. Pix2pix [5] solves the image-to-image translation problems, but needs strict paired data. It is difficult to obtain such dataset for our selfie cartoonization task. To break this fundamental limitation, CycleGAN [7] enforces the one-to-one mapping with the help of cycle consistency loss. Based on this term, UNIT [8] makes a shared-latent space assumption and proposes an unsupervised image translation framework. The latest study named CartoonGAN [9] performs well in generating high-quality cartoon images from real-world photos. However, all the above methods fail to generate satisfactory cartoon portraits for these two difficult points: (1) How to synthesize detailed facial features as far as possible. (2) How to keep the artistic characteristics (e.g.,smoothing areas and clear contents) of the cartoon images. As a comparison, our method outperforms these state-of-the-art methods.

3 Proposed Method

Formally, the selfie cartoonization problem is formulated as learning a mapping function from domain formed by real selfies to domain formed by cartoon portraits. The mapping function is learned using training data in domain and in domain , where and represent the numbers of two domain images. We denote the data distribution as and . Like the classic GAN, we design the generative function that produces vivid images to confuse the discriminative function , while the optimization procedure of aims to distinguish the real cartoon portraits in the domain from the fake ones generated by . Let be the loss function, and be the optimal weights of the networks, so we aim to solve the min-max problem:


3.1 Network Architecture

Figure 2: The framework of our scGAN. Besides the adversarial loss, we also propose three adaptations, namely total variation, attentive cycle and perceptual loss, based on the attention mechanism. For simplicity, we only present one direction transformation: selfie domain cartoon domain. We omit the other direction in this figure: cartoon domain selfie domain.

Our network architecture is shown in Fig. 2. Generator firstly translates into a hand-painted cartoon image , then generator brings back to the original selfie. The other direction from selfie domain to cartoon domain is similar. For simplicity, We omit it in the Fig. 2. The generator () has an encoder-decoder structure. In consideration of the texture features of cartoon portraits and the size of datasets, we apply a Unet [14] based architecture. As proved in pix2pix, Unet which is widely used in medical image segmentation can retain the prominent edges such as the eyelid. Moreover, to guarantee that the learned function is able to map an individual input to a designed output , we reconstruct the input image with the same encoder-decoder structure (). For judging whether the generative result is a cartoon style image, we use a simple patch-level discriminator ([15]. And we have a same discriminator () in the other direction.

3.2 Adversarial Loss

We apply the adversarial loss [4] to both networks and . For the mapping function and its discriminator , we define the loss function as:


where the generative function tries to produce hand-painted cartoon portraits which look the same as images from domain , and meanwhile aims to distinguish between synthetic samples and real samples . The mapping function and its discriminator are similar to the former.

3.3 Attentive Adversarial Network

Besides the general adversarial loss, our scGAN applies attention mechanism for cartoonization task, which is described in the following parts.

Region Level: Attentive Cycle Loss. As discussed in CycleGAN, cycle consistency loss can break the reliance on paired data in domain transfer. On this basis, we propose an attentive cycle loss to guide the generator to lay more emphasis on detailed facial features, such as the slender eyelashes, attractive pupils and so on. In other words, doing so can capture the regions with great attention.

Specifically, given a real selfie , we first apply a face parsing algorithm that can be a simple detector to detect each facial component, then obtain a set of attentive region location information, which is given by:


where is a trained detector for detecting the facial components, denotes the number of attentive regions, and stands for the bounding boxes of each region. As a result, the more attentive parts we choose, the better results we get. We use to express the region of image that is inside . Therefore, the attentive cycle loss is defined as:


controls the relative importance of each facial component. Larger drives the input images to maintain more regional information, and therefore, results in more detailed hand-painted cartoon portraits generation. In our experiments, we set , and represents the weights for [the whole image , the eyes, the nose, the mouth] regions, respectively. Similarly, for the transformation from domain to , we only adopt a standard cycle consistency loss as:


Pixel Level: Total Variation Loss. Cartoon images usually have unique characteristics with high-level simplification and uniform color distribution. Based on this observation, we do one small experiment about image gradient distribution for hand-painted cartoon portraits. Fig. 3 provides a demonstration. Visually, there are fixed rules in gradient distribution maps for images drawn by painters. Gradient changes are not obvious in most regions except for the edge parts of facial structures. However, there are a complex set of unavoidable factors (e.g., illumination, wrinkle) in our selfies which would interfere transformation process severely.

Figure 3: The gradient distribution maps. (a) is drawn by painter, (b) is generated without tv loss.

In such case, one more important target is to keep edge parts and remove blurry factors. So we propose applying total variation as a loss function which only highlights important edge and content information in cartoon portraits. As discussed in [16], total variation as a regularizer method plays an important role in traditional image processing fields. But for obtaining faultless results we define total variation as a constraint function:


In this way, we minimize the synthetic hand-painted portraits’ gradient in the experiment, we find the results remove undesired details like the black regions of cheek as well as preserving important edge parts.

Perceptual loss

. In many classical cases, a network with a low content loss (e.g., Mean Square Error) on each pixel leads to the blurry artifacts on generated results. In our method, rather than directly measuring differences between the input images and the output images using Euclidean distance, we apply perceptual loss to calculate differences between high-level feature maps extracted by 19-layer VGG networks pre-trained on ImageNet dataset. As has discussed in 


, perceptual loss is more able to maintain image content and overall spatial structure when used as feature reconstruction loss. Compared with per-pixel differences, we find it has better object preservation ability, and it can accelerate convergence. Accordingly, we define this loss as:


where represents the feature maps of a specific VGG layer. In our training, we choose the layer ’conv4_4’ to compute this loss.

3.4 Full Objective Function

Overall, our full loss function is:


where are the weights to trade off different losses.

4 Experiments

We implemented our scGAN in Pytorch. All experiments were performed on an NVIDIA P40 GPU. For our method, we set

= 10, = 2 and = 0.5 in Equation 8.

4.1 Data collection

The training data is made up of real selfies and cartoon portraits, and the test data only contains real selfies. Real selfies are downloaded from Google Image Search using keywords (e.g., Women Portrait). We totally obtain 3,524 real selfies above the shoulder with the image cropping skills. 3,300 of these images for training and others for testing. Cartoon portraits are composed of three different styles as shown in Table 1. These are difficult to obtain because there are no public datasets on human and cartoon portraits due to cost. So we can only download from online painting stores. However, different painters have different artistic styles and the most cartoon artworks contain digital image watermarks with concern about copyright. Taking these factors into consideration, we build the dataset by performing some preprocessing steps such as filtering, cropping etc.

Style Numbers
Hand-painted 850
Watercolor 730
Anime 3000
Table 1: The cartoon portraits dataset.

4.2 Comparison with state-of-the-arts

We present the comparative experiments between our method and representative stylization methods as follows:

  • Image Binarization

    is a traditional image processing method which can obtain the black and white cartoon style portraits.

  • NST [3] is a CNN-based stylization work. This method can transfer the artistic style of an image to a target image while keep the content of the target image.

  • UNIT [8] is a recent unsupervised image2image translation method which based on the shared-latent space assumption.

  • CartoonGAN [9] is a latest work to transforming photos of the real-world scenes into cartoon style images.

  • CycleGAN [7] is a distinguished approach for learning to translate an image from a source domain to a target domain in the absence of paired examples.

Fig. 4 shows the visual results. For lack of paired data, we run a perceptual study on this transformation. Specifically, we randomly shuffle all the methods and hide their name in Fig. 4 then gather statistical results from 125 participants. According to the established grading rules (e.g., presentation style, overall quality, detail description and so on.), participants are asked to grade these generated cartoon images using a 5-point system, and a higher score represents more satisfaction overall. Table 2 shows results regarding scoring system.

Figure 4: Results of comparing with different methods. (a) Input images, (b) Image binarization, (c) NST, (d) CartoonGAN, (e) UNIT, (f) CycleGAN, (g) Our results.

We can see that our proposed method succeeds in selfie cartoonization from overall structures to detailed facial features. In more specific terms, image binarization as a basic method for image segmentation merely shows the rough facial shape. As for NST, we select one image as randomly because all of the hand-painted cartoon portraits have little difference basically in style. We can see it fails to learn the style well, let alone keep delicate details of the local regions. CartoonGAN as a dedicated GAN, the results are less than satisfactory in this task. We use the default parameters in literature for experiments. The stylization results of UNIT and CycleGAN perform slightly better which start to capture the major lineament at first glance, but the fine areas do have a large gap compared with our proposed method. In comparison, our method is able to generate correct colors, obtain sharper and detailed textures, and output overall more satisfactory hand-painted cartoon portrait results.

Methods\Scores 1 2 3 4 5
Binarization 20% 32.8% 32% 11.2% 4% 2.46
NST 56% 20% 16.8% 4.8% 2.4% 1.78
CartoonGAN 48% 34.4% 13.6% 1.6% 2.4% 1.76
UNIT 3.2% 19.2% 35.2% 38.4% 4% 3.21
CycleGAN 12% 24.8% 34.4% 18.4% 10.4% 2.90
Our method 3.2% 12% 21.6% 33.6% 29.6% 3.74
Table 2: The results of our 5-point system. By taking the average, it shows that our method are most satisfactory.

4.3 Ablation Study

To verify the role of each part in our model, here we make ablation experiments to isolate the influence of each term.

Total Variation Loss. Fig. 5 illustrates some examples of how tv loss influences the result. For input images (a) and (d), the results without tv loss are shown in (b) and (e), which show noisy and inconsistent regions. After using tv loss, we obtain results like (c) and (f), which are much smoother and clean. The intuition behind is that tv loss, as a regularizer, would force the network to highlight important edge information (e.g., delicate eyeliner) and remove dirty factors (e.g., clean cheek). To further investigate, we also analyze the change of the images’ average gradient, as shown in Table 3, and we observe that minimizing the gradients always leads to visually better results. For real life applications, there is often inevitable illumination or complex background influence, therefore, it is very necessary to impose constraints such as tv loss to obtain more satisfactory results.

Figure 5: Results of the ablation study for tv loss. Input images: (a) and (d). Without tv loss: (b) and (e). With tv loss: (c) and (f).
index w/o tv loss w/ tv loss
(1) 36.5 30.9
(2) 42.8 36.8
Table 3: The average gradient performances for generated results in Fig. 5.

Attentive cycle and Perceptual loss. We also conduct experiments to study the impact of attentive cycle and perceptual loss. The experiment setup is described in Table 4 and some example outputs are illustrated in Fig. 6 correspondingly. Using only the CycleGAN loss (Experiment A) does not give satisfactory results, as shown in second column of Fig. 6. The model trained with our attentive cycle loss (Experiment B) shows more potential high quality facial parts, eyes, nose and mouth. In Experiment C, adding the perceptual loss further improves the results by better preserving low-level features (e.g., the overall eyebrows and the color in mouth).

Table 4: Ablation study setups.
Figure 6: Results of the ablation study. The first column are input selfies. Each row in the remaining columns shows the cartoonization results of three experiments: A, B, C.

4.4 Anime and Watercolor Style Generation

We also attempt to translate real selfies into other styles, the results are shown in Fig. 7. and Fig. 8.

Figure 7: Results of selfie cartoonization for anime style.
Figure 8: Results of selfie cartoonization for watercolor style.

5 Conclusion

In this paper, we propose scGAN, a novel method for selfie cartoonization with attentive adversarial networks. Based on domain knowledge, we propose to utilize attention mechanism for cartoon domain generation. In addition, we construct and release a new dataset for domain transformation task between selfies and cartoon portraits. In the future work, we would like to investigate how to generate full body cartoon images with high quality and efficiency. Furthermore, how to handle videos in real time for supporting the interesting applications is another promising research direction.


. This work was partially funded by National Natural Science Foundation of China NO.61602463, and the Open Project Program of the National Laboratory of Pattern Recognition (NLPR).


  • [1] John Canny, “A computational approach to edge detection,” in

    Readings in computer vision

    , pp. 184–203. Elsevier, 1987.
  • [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436, 2015.
  • [3] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2414–2423.
  • [4] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [5] P. Isola, J. Zhu, T. Zhou, and A. A. Efros,

    “Image-to-image translation with conditional adversarial networks,”

    arXiv preprint, 2017.
  • [6] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman, “Toward multimodal image-to-image translation,” in Advances in Neural Information Processing Systems, 2017, pp. 465–476.
  • [7] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” arXiv preprint, 2017.
  • [8] M. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in Advances in Neural Information Processing Systems, 2017, pp. 700–708.
  • [9] Y. Chen, Y. Lai, and Y. Liu, “Cartoongan: Generative adversarial networks for photo cartoonization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9465–9474.
  • [10] S. Ma, J. Fu, C. Chen, and T. Mei, “Da-gan: Instance-level image translation by deep attention generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5657–5666.
  • [11] H. Li, G. Liu, and K. N. Ngan, “Guided face cartoon synthesis,” IEEE Transactions on Multimedia, vol. 13, no. 6, pp. 1230–1239, 2011.
  • [12] M. Yang, S. Lin, P. Luo, L. Lin, and H. Chao, “Semantics-driven portrait cartoon stylization,” in Image Processing (ICIP), 2010 17th IEEE International Conference on. IEEE, 2010, pp. 1805–1808.
  • [13] J. E. Kyprianidis, J. Collomosse, T. Wang, and T. Isenberg, “State of the” art : A taxonomy of artistic stylization techniques for images and video,” IEEE transactions on visualization and computer graphics, vol. 19, no. 5, pp. 866–885, 2013.
  • [14] Olaf R., Philipp F., and Thomas B., “U-net: Convolutional networks for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015.
  • [15] C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian generative adversarial networks,” in European Conference on Computer Vision. Springer, 2016, pp. 702–716.
  • [16] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5188–5196.
  • [17] J. Johnson, A. Alahi, and L. Fei-Fei,

    “Perceptual losses for real-time style transfer and super-resolution,”

    in European Conference on Computer Vision. Springer, 2016, pp. 694–711.