Cartoon face is a pivotal part to understand and interact with the virtual world [46, 5, 22]. Precisely recognizing these cartoon characters is a critical prerequisite for many commercial applications, such as automatic cartoon face verification 
, computer-aided cartoon face generation, cartoon movie recommendation . With the advent of large-scale face recognition (FR) datasets such as MS-Celeb-1M , VGGFace2 , MegaFace , WebFace260M
, the deep-learning-based models have achieved promising recognition and verification accuracies on human faces[43, 35, 17, 24, 41, 10]. While human facial recognition technology is becoming increasingly sophisticated and mature, cartoon character recognition is still in its infancy. The performance gap is mainly achieved by the utilization of tremendous manual labeling datasets, which is extremely deficient in cartoon media.
To fill the gap between the FR accuracy on the human faces and the cartoon faces, researchers make efforts on collecting large-scale cartoon face datasets such as iCartoonFace , Danbooru . However, cartoon face recognition is still challenging due to its intrinsic characteristics [42, 6, 32], i.e., (1) cartoon faces usually have smooth and sparse color regions, the texture information is quite limited, (2) details of cartoon faces are mainly outlined by its structure/shape that consists of sharp and clear edges. Thus the key to distinguish different cartoon characters is to precisely spot the sparse and critical shape patterns . In biological vision, shape is arguably the most important cue for recognition . However, standard convolutional neural networks (CNNs) actually exhibit a “texture-bias” [13, 19], they lack the shape representations and processing capabilities that form the primary bases of human object classification .
Inspired by the works [7, 11] that spot the fine-grained details in the input images via training with the jigsaw patches, we challenge the problem of cartoon face recognition by introducing an effective jigsaw solving approach to learn the critical shape patterns that are discriminative for cartoon face recognition. As texture information is quite limited in the cartoon faces, the classification network has to pay more attention to the local critical shape patterns to solve the jigsaw. Specially, to utilize those implicit relations between jigsaw fragments, we build the graph model to describe the topological layout of permutated jigsaws, where each fragment is treated as a graph node in the jigsaw puzzle. To further reason and recover the correct layout of the jigsaw patches, we introduce graph convolutional network (GCN) [21, 39] by propagating neighbor information in an iterative manner. As the perception and aggregation of contextual information, the recovery of disordered jigsaws becomes more accessible in a self-supervised mode. Thus, the graph jigsaw model can learn to understand the local to holistic shape information, capture and encode the discriminative shape patterns, so as to identify each patch accurately to infer their affinities and relative placements.
It is worth noting that directly training the classification model with images that consist of shuffled patches would introduce inevitable noisy visual patterns that are harmful for the final classification . To tackle this issue, we propose to introduce the jigsaw tasks on the intermediate stages in the classification network. A classification network such as ResNet-50  typically has various stages where each stage consists of several groups of convolutional blocks, we thus construct one jigsaw puzzle by shuffling the input of each stage in the spatial dimension and then build one graph in the space of the shuffled features. To solve the jigsaw puzzle, we exploit the GCN to encode the adjacent correlations among the graph vertices by aggregating their neighbors iteratively. During this encoding process, GraphJigsaw captures the layout of current disordered states and perceives the characteristics of the input graph. We then recover the correct layout of the permuted vertices with the encoded vertices in an adaptative decoding process. To force the model solving the jigsaw via spotting the shape information rather than the low-level statistics (e.g., the pixel intensity, color distribution, chromatic aberration) , we propose to self-supervise the GraphJigsaw by the output of each stage in the classification network. We call the proposed method GraphJigasw as we recover the correct layout of the jigsaw fragments in an adaptive graph encoding and decoding process. Our experiments show that this formalization of a jigsaw solver provides a much stronger model for cartoon face recognition.
Generally, CNN tends to capture the low-level features such as edges, corners, color conjunctions in the early layers [45, 25]. As the shape patterns are quite sparse in the cartoon faces, they are difficult to be perceived with the standard CNN. As illustrated in the bottom row in Fig. 1, the standard CNN without GraphJigsaw fails to perceive the shape characteristics in the early stages (sub-figures (d), (e)) and only shows part correct attention in the last stage (sub-figure (f)). To fully exploit the rich hierarchical information in the CNN and propogate the learned shape patterns gradually, we propose to incorporate the GraphJigsaw in various stages of the classification network in a top-down manner. As illustrated in the top row in Fig. 1, the model with GraphJigsaw at different stages is capable of focusing on the most discriminative regions in a progressive manner (sub-figures (a-c)). With the propressive GraphJigsaw, the valuable shape characteristics of the input cartoon faces can be reserved in the early stages and strengthened in the later stages gradually. More visual examples can be found in Fig.4 in section IV-C.
In summary, the contributions of this work are summarized as follows:
We propose the GraphJigsaw that incorporates the jigsaw puzzle in the classification network and solves the jigsaw with graph convolutional network. GraphJigsaw learns to understand the local to holistic shape information in the input cartoon faces in the jigsaw solving process. GraphJigsaw is only used in the training phase and brings no computational cost during the deployment.
GraphJigsaw does not rely on extra manual annotation and can be incorporated in various stages in the classification network, which facilitates propagating the learned shape patterns of the input cartoon faces gradually. To our best knowledge, GraphJigsaw is the first work that constructs and solves the jigsaw puzzle on the imtermediate convolutional feature maps.
Extensive experiments on two publicly largest cartoon face datasets verify the superiority of our proposed GraphJigsaw. Our proposed method can serve as a strong backbone to facilitate future research on cartoon face recognition.
Ii Related Work
We review the previous work considering two aspects that are related to ours, i.e., the similar tasks (cartoon face recognition) and related techniques (jigsaw puzzles solving).
Cartoon Face Recognition. Over the recent years, many literatures aim to automatically generate a cartoon-style face with an input RGB image [6, 42] or sketches [40, 9], while cartoon face recognition has been less explored and remains a challenging problem. Automatic cartoon face recognition performs poorly in daily life applications due to the lack of large-scale real-world datasets. Existing datasets (WebCaricature , IIIT-CFW , Manga109 ) contain limited examples and fail to reflect the true distribution of the cartoon faces. Among the existing methods, Takayama et al  extracted features that reflect skin color, hair color, hair quantity for cartoon face recognition. Saito et al 
proposed a CNN-based model for cartoon image retrieval. Zhou et al. constructed a ToonNet that contains thousands of cartoon-styled images and introduced several techniques for building the deep neural network towards cartoon face recognition. Zheng et al.  proposed a meta-continual learning method that is capable of jointly learning from heterogeneous modalities such as sketch, cartoon, and caricature images. These pioneering methods brings inspiration for the research on cartoon face recognition. In this work, we study the automatic cartoon face recognition by adopting the iCartoonFace  and Danbooru  datasets, which are the largest and most comprehensive datasets for cartoon face recognition. We argue that the shape characteristics can be better exploited for accurate cartoon face recognition.
Jigsaw Puzzles Solving. There are many literatures on solving jigsaw puzzles computationally [8, 31, 23, 37, 3] that rely on the low-level texture statistics across fragments. However, solving a Jigsaw puzzle based on these cues does not require to full understand the global shape of the object. Besides, jigsaw solving has been studied for pretraining , fine-grained image recognition [7, 11], mixed-modal image retrieval . Noroozi et al  posed jigsaw solving as a permutation recognition task and introduced the context-free network (CFN). CFN takes image patches as input to learn both a feature mapping of object parts as well as their correct spatial arrangement. Pang et al  formulated the puzzle in a mixed-modality fashion and learned a pre-training model for fine-grained sketch-based image retrieval. For fine-grained recognition, Chen et al 
proposed a novel “Destruction and Construction Learning” (DCL) method that enhances the difficulty of fine-grained image recognition by directly training with the destructed images and estimating the locations of the shuffled patches. Du et al proposed a random jigsaw patch generator that encourages the network to learn features at specific granularities with a progressive training strategy (PMG). Compare with DCL and PMG, our proposed GraphJigsaw does not adopt deconstructed images for training and avoids introduce noisy visual patterns. We treat each fragment in the jigsaw puzzle as a graph node and exploit GCN to reason the proximity and relative placements of the fragments in the jigsaw puzzle. By solving the puzzle with the GCN in a self-supervised mode, GraphJigsaw learns to identify the discriminative local to global shape patterns of the input cartoon faces that consist of little texture information.
Iii Graph based Jigsaw puzzle Solver
This paper aims to propose a graph convolutional network based jigsaw solver for cartoon face recognition. The proposed GraphJigsaw does not require any external supervision or additional manual labels. By adding GraphJigsaw into the classification model, i.e., shuffling the preceding block of the network and then solve the jigsaw implicitly by mimicing the spatial layouts of a deeper block, the network can learn to strengthen its representations and fully perceive the structural characteristics of the input cartoon faces.
Fig. 2 illustrates the framework of our proposed GraphJigsaw, which consists of three parts: (1) Graph construction. (2) Face jigsaw encoding. (3) Face jigsaw decoding. In graph construction, GraphJigsaw shuffles the input feature map in the spatial dimension and constructs a connected graph with the shuffled fragments. In jigsaw encoding, GraphJigsaw encodes and perceives the adjacent correlations among the input graph vertices. In jigsaw decoding, GraphJigsaw decodes the input graph and recovers the correct layout of the shuffled patches. Below, we present the details of the three parts in GraphJigsaw.
Iii-a Shuffled Graph
Our proposed GraphJigsaw can be incorporated into those conventional convolution networks such as ResNet . A ResNet typically consists of several stages (or block groups), among them are commonly used. The input feature map at the -th intermediate stage is denoted as , where , , are the number of channels, height, width of the feature map. For convenience, we reset the initial value of as 1, i.e., w.r.t . Similarly, we denote the output feature map at the -th stage with . Here, our objective is to incorporate the proposed GraphJigsaw at different intermediate stages, so that those subtle stripes of cartoon faces can be captured in a self-supervised mode.
To incorporate the jigsaw task, we randomly shuffle in the spatial dimension to obtain one jigsaw puzzle, formally,
where is the shuffled representations, and is an permutation operation in the 2D plane space. To reduce the computation burden, we may properly downscale the size of the feature map before shuffling, which however cannot significantly degrade the performance in our experience. Taking the classic pooling strategy for example, we can obtain the small-scale feature map,
After spatial shuffling, the new representation111For simplification, we omit the state symbol in the superscript in the following sections. tends to lose the internal spatial topology information. To mine and recover the spatial relations, we define one graph
in the space of the shuffled feature, named shuffled graph. Concretely, we treat each spatial position as one node, and the feature vector at one position as the node attribute. Formally, the shuffled graph is denoted as, where is the node set with the size , is the adjacency matrix of nodes, and is the attribute matrix. The attribute matrix can be obtained by flattening the feature map in the spatial dimension, i.e., , where the subscripts denote the operated dimensionality. Thus, the feature vector w.r.t the -th node may be denoted with , which corresponds to the -th column of . For the adjacency matrix , we directly employ the spatial position relations of nodes in the shuffled space. If vertices and are spatial neighborhood, then , otherwise . One may raise the problem that the shuffled feature has discarded the original spatial order after random permutation. Actually, the permutation relations are just used to favor in self-supervision learning, which attempts to recover the original orders. And, the effect of adding edges is to describe/momerize the current topological structures, such that this jigsaw puzzle can be well solved.
Iii-B Face Jigsaw Encoding
Given one shuffled graph , GraphJigsaw needs to encode and perceive the adjacent correlations among the input graph vertices for the sake of better decoding at the next stage. To this end, we use the signal diffusion on graphs. Concretely, the encoding process follows a neighborhood aggregation strategy, where we iteratively update the representation of a vertex by aggregating its neighbors. After iterations of aggregation, a vertex’s representation can capture the structural information of its -hop neighbor region. For -th node , the feature aggregation can be formulated as:
where is one mapping function, Agg
means an aggregation operation such as max-pooling or weighting summary,means the number of iterations, is a set of nodes adjacent to , is the total iteration number. Similar to convolution on graphs, the aggregation process may be designed as one network layer, formally,
is a learnable projection matrix and bias vector,is the degree of node , and
denotes one nonlinear activation function such as ReLU. Each iteration can be viewed as one convolution layer, so increasing iterations can enlarge the perception scope. Jigsaw encoding aggregates messages of the surrounding vertices, thus captures the layout of current disordered states. In other words, the characteristics of the input graph can be well perceived and encoded in such a Jigsaw encoding process.
Iii-C Face Jigsaw Decoding
In order to solve the jigsaw puzzle, we need to recover the correct layout of the permuted vertices. To this end, we design a graph decoding module to re-organize the jigsaw. The output of Jigsaw encoding, is used as the input of the graph decoding process. To mine the correlation of jigsaw patches in the decoding part, we introduce the self-attention mechanism  by computing the attention coefficients between each pair of vertices. The advantage is two-fold: i) better support the recovery of spatial positions, and ii) reduce the difficulty of self-supervision learning. Formally, we can define the attention weight between nodes and as
where means a single-layer feedforward neural network, denotes the activation function, and is the learned parameter. To make the learned coefficients comparable across different vertices, we normalize the coefficients for each vertice using the softmax function:
To recover Jigsaw orders, we take the deconvolutional strategy similar to the encoding process as used in Eqn. (3) and Eqn. (4). The attention mechanism is integrated into this process. Actually, when the attention weights are directly used as the adjacency relations in the matrix , the adjacency matrix adaptively encodes the feature-wise correlation between the vertices. Thus we may integrate the shuffled states into the attention matrix in order to properly enhance the learning ability of self-supervision, i.e., , where 1 is the indicator function. Through the attention adjacency matrix, we can properly diffuse the messages of vertices and decode the jigsaw puzzle after stacking deconvolutional layers like Eqn. (4).
To constrain the decoding process, a natural choice is to supervise GraphJigsaw with the original un-shuffled representation . However, the model will exploit low-level statistics, such as the pixel intensity/color distribution, and chromatic aberration to solve the jigsaw puzzle . To avoid this side effect and force the classification model to absorb the desired structure information, we adopt the output feature map for -th stage as the self-supervised signal. To match the feature map size, we downscale the spatial size of with the similar pooling operation . Therefore, at the -th stage, the discrepancy between the input shuffled information and the corrected layout can be defined as,
where and denote the graph encoding and graph decoding part, respectively. In this Jigsaw encoding and decoding process, GraphJigsaw learns to understand the critical shape information of the input cartoon faces, capture and encode the discriminative shape patterns to identify each patch accurately to infer their semantic correlations. Thus the representation learned by the classification model is reinforced and strengthened. We will verify the effectiveness of GraphJigsaw in section IV.
Iii-D Integration with the Classifier
Apart from training cartoon face recognition network with the conventional softmax loss, we incorporate the stage-wise GraphJigsaw in a top-down manner into each stage of the classification network. Let denotes the total stages, the full objective function can be formulated as,
where is the softmax loss used for cartoon face recognition, means a trade-off factor that control the importance of the GraphJigsaw constraint.
In this section, we validate the effectiveness of the proposed GraphJigsaw on two recently released cartoon face datasets: iCartoonFace , Danbooru . After the implementation details in Section IV-A, we compare GraphJigsaw with the state-of-the-art face recognition methods and the representative jigsaw-based methods. Then, we analyze different configurations of GraphJigsaw with comprehensive ablation studies. Finally, we visualize the learned attention maps and the image retrieval results to qualitatively verify the effectiveness of the proposed GraphJigsaw.
Iv-a Implemental details
Training: We evaluate GraphJigsaw on two ResNet-based classification networks , i.e., ResNet-50 and ResNet-101, and a DenseNet-based classification network , i.e. DenseNet169. The input images are resized to a fixed size of and randomly cropped into
. Random horizontal flip is exploited for data augmentation. A ResNet or DenseNet typically consists of 4 stages (or block groups), each stage in a ResNet consists of multiple bottleneck blocks with residual connections. Similarly, each stage in a DenseNet contains a densely connected block. For each stage in the classification network, GraphJigsaw takes the input of the stage for one jigsaw puzzle creation and the output of the stage as the self-supervised signal for solving the puzzle. During the training process, the category label of the cartoon face images is the only manual annotation used for training. We implemented all the experiments using PyTorch on a Titan-X GPU with 12GB memory. During the training stage, we set the actual batch size as 96, 60 and 64 for the ResNet-50, ResNet-101 and DenseNet-169 models, respectively. The loss weight in Equ. 8
was set as 0.1 by grid search. We trained our proposed model for 60 epochs until convergence. Without special mention, the default sizeof the jigsaw puzzle is set to in this paper. The influence of the choice of is discussed in Section IV-C.
Directly training the whole network with all the GraphJigsaws at all the four stages simultaneously makes the training hard to coverage. In contrast to that, we propose to incorporate a GraphJigsaw in a top-down manner at each training iteration, this allows the model to mine discriminative shape patterns of the input cartoon faces from local details to global structures when the features are gradually sent into higher stages. We verify the effectiveness of this progressive training strategy in Section IV-C.
Inference: At the inference phase, we merely input the original cartoon face images into the trained model and the GraphJigsaw is unnecessary. It means the proposed method does not introduce computational overhead at inference time.
Dataset: We adopt the iCartoonFace  and Danbooru  dataset for training and evaluation. iCartoonFace is a recently released challenging benchmark dataset for cartoon face recognition. It consists of 389,678 images of 5,013 cartoon characters annotated with identity, bounding box, pose, and other auxiliary attributes. iCartoonFace is currently the largest-scale, high-quality and rich-annotated dataset in the field of cartoon face recognition. The test set of iCartoonFace consists of a gallery set and a probe set. The gallery set consists of 2,500 images. The probe set consists of 20,000 images from 2,000 cartoon characters, where the number of images for each character ranges from 5 to 17. Danbooru is an anime character recognition dataset that includes about 0.6M (0.54M images for training and 10k images for testing) images. We excluded the categories that have less than 10 images in the training set and obtained approximately images from 5127 classes for training. The number of the per-category images ranges from 10 to 10289. The test set consists of 10000 images from 3670 categories. For the test set in Danbooru dataset, we excluded the categories that merely consist one image and obtained 7622 images from 1292 classes for evaluation. The images for each character in the test set ranges from 2 to 199.
Evaluation Metric: Following , we present the experimental results of cartoon face recognition with the identification rate of Rank@K. In the identification test setting, the probe set includes cartoon persons and each cartoon person has images. We test each of the images per cartoon person by adding it to the gallery of distractors and use each of the other images as a probe. Given a probe image, we rank orders of all images in the gallery set based on their similarity to the probe image. Thus Rank@K is obtained by computing the percentage of the -th shot in the sorted similarity list.
Iv-B Comparison with state-of-the-arts
We compare GraphJigsaw with state-of-the-art face recognition (FR) methods and representative jigsaw-based approaches. For FR methods, we exploit the same classification backbone network with three popular FR losses: softmax, CosFace , ArcFace . Among them, CosFace reformulates the softmax loss as a cosine loss and introduces a cosine margin term to maximize the decision margin in the angular space. ArcFace directly optimizes the geodesic distance margin by virtue of the exact correspondence between the angle and arc in the normalized hypersphere to obtain highly discriminative face recognition features. We additionally compare GraphJigsaw with the method in  on the iCartoonFace dataset. In , zheng et al. jointly utilizes the human and cartoon training images with various discriminative regularizations in a multi-task domain adaptation manner. We term the method in  as MTD, as illustrated in Tab. IV.
Comparison with the FR methods: Tab. I, II report the experimental results on the iCartoonFace dataset. The results in Tab. I show GraphJigsaw outperforms the compared three FR methods (Softmax, CosFace, ArcFace) by 7.26%, 3.38%, 2.67% with the ResNet-50 backbone in Rank@1, Rank@5 and Rank@10, respectively. The similar improvements can be observed with the ResNet-101 backbone, as illustrated in Tab. I. Tab. II shows the experimental results with a DenseNet-169 backbone on the iCartoonFace dataset. Compared with the method in  that jointly training the DenseNet-169 classification model with both the human and cartoon training images in a multi-task domain adaptation manner, our proposed GraphJigsaw obtains consistent improvements on various rank rates. It is worth noting that the method in  adopts CASIA-WebFace dataset  as the auxiliary training images, while GraphJigsaw merely exploits the cartoon face images in the iCartoonFace dataset. Tab. II also show that our proposed GraphJigsaw outpperforms the compared FR methods by 3.77%, 3.32%, 2.96% on the three rank rates with the DenseNet-169 network. The experimental results in Tab. I, II show that improvements of our proposed GraphJigsaw are consistent across different network structures.
Tab. III, IV illustrate the experimental results on the Danbooru dataset. It is clear that GraphJigsaw consistently outperforms the compared FR methods with consistent improvements with the ResNet-/DenseNet-based backbones. On the two cartoon face datasets, the softmax-based FR method lags behind GraphJigsaw because the standard CNN is biased towards texture, it is quite difficult to perceive the sparse and critical shape patterns of the cartoon faces. CosFace and ArcFace obtain slight improvements over the softmax-based FR method. However, there is still a significant performance gap when compared with GraphJigsaw.
Comparison with the jigsaw-based methods: When comparing GraphJigsaw with DCL  and PMG  in table I, III, II and IV, we observe that our proposed GraphJigsaw always has a better performance, with higher performance on various rank rates. On the iCartoonFace dataset, GraphJigsaw outperforms the two jigsaw-based approaches with 20.33%, 3.26% improvements with the ResNet-50 backbone. GraphJigsaw also obtains 18.45%, 2.92% improvements with the ResNet-101 backbone and 16.48%, 4.06% with the DenseNet-169 backbone. The similar improvements can also be observed on the Danbooru dataset, as illustrated in Tab. III, IV. The experimental results show GraphJigsaw-learned cartoon face representations are more discriminative compared with DCL  and PMG . It is because directly training with the destructed images will introduce noisy visual patterns , and the features learned from these noise visual patterns are harmful to the classification task. Our proposed GraphJigsaw bypasses this dilemma by solving the jigsaw in the intermediate stages in the backbone network. In addition, the constructed jigsaw puzzles in the top-down stages of the classification model are solved progressively in a self-supervised manner, thus facilitates the model to spot the critical shape patterns gradually.
We visualize the Cumulative Match Characteristic (CMC) curves of the proposed GraphJigsaw and other compared methods based on the ResNet-50 and the DenseNet-169 backbones, as illustrated in Fig 3. The sub-figures (a), (b) show the CMC curves on the iCartoonFace dataset, and the sub-figures (c), (d) in Fig 3 illustrate the curves on the Danbooru dataset, respectively. The results in Fig 3 illustrate that GraphJigsaw obtains a leading performance among the state-of-the-art FR and the jigsaw-based methods, including the low-rank rate (Rank@1) and high-rank rate (Rank@10). The considerable improvements of GraphJigsaw over other compared methods can be observed in Fig 3 (a),(b),(d). This illustrates that the improvements of the proposed GraphJigsaw are consistent on various rank rates for cartoon face identification. We will show that GraphJigsaw is capable of spotting the critical shape characteristics of the cartoon faces with qualitative visualization results in Section IV-D.
|Models trained on iCartoonFace dataset|
|Models trained on Danbooru dataset|
Iv-C Ablation Study
We conducted comprehensive experiments to evaluate the influences of various hyper-parameters in GraphJigsaw in order to better understand our method. We also explored the influence of the progressive training strategy.
|Models trained on the iCartoonFace dataset|
|M = 3||83.65||93.42||95.42|
|Models trained on the Danbooru dataset|
|M = 3||52.74||67.40||72.15|
Retrieved images using the GraphJigsaw-learned features (top rows in (a-h)) and the features extracted from the baseline model (bottom rows in (a-h)). The images with red-dotted borders mean the incorrect retrieved images. The results indicate the proposed GraphJigsaw is capable of enhancing the intra-class compactness and inter-class discrepancy of the learned representations. Better viewed in color and zoom in.
Evaluation of the stage-wise GraphJigsaw: We analyze the performance variations of our method by fully removing the jigsaw puzzle (w/o GraphJigsaw) or only adding one jigsaw puzzle during the training process. As illustrated in Tab. V, without the jigsaw puzzles, the classification model show a sharp deterioration in the cartoon face verification performance. Adding the jigsaw puzzle at arbitrary stage in the classification model yields the improvements over the baseline model. With the jigsaw puzzle at each stage in a top-down manner, our proposed method obtains the highest improvements. This suggests the stage-wise GraphJigsaw is reasonable and feasible.
Evaluation of different number of graph vertices: We analyze the influence of different graph vertices used in our proposed GraphJigsaw and the experimental results are illustrated in Tab. VI. It is clear that our proposed GraphJigsaw obtains the best performance with , which means that the jigsaw puzzle has fragments. It suggests that the jigsaw puzzle with is easy to solve and the puzzle can be solved easily without spotting critical shape patterns of the input cartoon faces. When it comes to or , the jigsaw puzzle might be too confusing. It is because the cartoon faces usually have smooth and sparse color blocks, the jigsaw fragments may contain little or even no shape information that can be exploited to solve the puzzle.
We additionally explore the best number of iterations in the face jigsaw encoding (Section III-B) and decoding (Section III-C) parts on the iCartoonFace dataset. The experimental results in Tab. VII show that GraphJigsaw obtains the leading performance with one iteration in the encoding and decoding part, respectively. This is in line with the conclusion in [33, 49] that stacking more layers into a GCN causes over smoothing, eventually leading to features of graph vertices converging to the same value.
We show the cartoon face identification results without jigsaw (w/o jigsaw) or without the progressive training strategy (w/o progressive) in Tab. VII. It is clear that the jigsaw task plays a vital role in spotting critical shape characteristics of input cartoon faces. Without solving the jigsaw task, the classification model shows a significant performance decrease. It is because without solving the jigsaw puzzle, the standard CNN-based classification model is biased towards recognizing textures rather than the shapes patterns . With the proposed progressive training strategy, the classification model is capable of solving the stage-wise jigsaw puzzle in a progressive manner, and absorbing the discriminative shape characteristics in a top-down manner.
Visualization of the attention maps: To investigate how well the shape characteristics are spotted progressively, we visualize the stage-wise attention maps of our method with Grad-CAM  in Fig. 4. We merely illustrate the activation maps of the last three stages as the map of the first stage is not visually informative. In each sub-figure in Fig. 4, the top row show the attention maps of our proposed GraphJigsaw, the bottom row illustrate the attention maps of the baseline model that only exploits softmax loss for training.
As illustrated in Fig. 4, our model shows more meaningful concentration on the input cartoon faces in a progressive manner in the top row in each sub-figure (a-h). In detail, our model focuses on various meaningful local regions at the second stage like hair, eyes, mouth. When it comes to the third stage, our model starts to pay attention to the global shape of the input cartoon face images. Finally, the model spots the vital shape information that encodes the identity characteristics in the fourth stage. In Fig. 4 (a-f), the input cartoon characters are distinguished by the shape characteristics, e.g., the triangular stips around the neck in (a), the ornaments on the hair in (b), the beard in (c), the hairstyle in (d), (e) (f). Our proposed method spots these critical characteristics gradually from column 1 to column 3 in each sub-figure. As a comparison, the bottom rows in the sub-figures in Fig. 4 illustrate the baseline model hardly spots the critical areas with sparse shape characteristics and fails to perceive the shape patterns in the intermediate stages. The baseline model only shows the part correct attention at the last stage as shown in Fig. 4.
We additionally illustrate two cartoon sketches in Fig. 4 (g) and (h), where the input cartoon faces are represented with binary-colored sketches and the identity information heavily rely on the shape rather than texture information. Our model successfully captures the shape patterns in such extreme conditions. The results also indicate that the baseline model is texture-biased and is not capable of spotting critical shape information without extra constraints.
Visualization of the image retrieval results: To further investigate the distinctiveness of the GraphJigsaw-learned features, we visualize the retrieved results of several query images using the GraphJigsaw-learned features and the baseline-learned features in Fig. 5. The former and the latter are illustrated in the top and bottom row in each sub-figure (a-h) in Fig. 5. The images with red-dotted borders mean the incorrect retrieved images. When using the GraphJigsaw-learned features, the retrieved images show consistent cartoon face categories with the query images. For example, our model works well for the query images in sub-figures (a), (b), (c), (e), (f), (g) in Fig. 5 that are quite confusing as they contain little texture information. The baseline model shows dramatically reduced performance and fails to capture the subtle shape difference between the query image and the mis-retrieved images. Besides, the sub-figures (g), (h) in Fig. 5 show the retrived results given the elves images. The proposed GraphJigsaw show consistent retrieval performance, while the baseline model fails in most cases. It indicates the GraphJigsaw is capable of enhancing the intra-class compactness and inter-class discrepancy of the learned representations. The qualitative image retrieval results are in line with the quantitative experimental results in Section IV-B. They also highlight the importance of spotting the shape characteristics for cartoon face recognition.
Within this work we have presented GraphJigsaw to spot the critical shape patterns that are discriminative for cartoon face recognition. Our method is inspired by the fact that cartoon faces usually contain smooth color blocks and the key to recognize cartoon faces is to precisely spot their critical shape information. We achieve this goal by constructing jigsaw tasks at various stages in the classification network and solving the tasks with the graph convolutional network in a progressive manner. Extensive experiments demonstrated that our proposed method outperforms the state-of-the-art face recognition and the the jigsaw-based learning methods. Our method is conceptually elegant and we hope it will shed light on understanding and improving the performance of cartoon face recognition.
-  (2019) Danbooru2018: a large-scale crowdsourced and tagged anime illustration dataset. In https://www.gwern.net/Danbooru2018, pp. Accessed: DATE. Cited by: §I, §II, §IV-A, §IV.
Deep convolutional networks do not classify based on global object shape. PLoS computational biology 14 (12), pp. e1006613. Cited by: §I, §IV-C.
-  (2020) Solving jigsaw puzzles with eroded boundaries. In , pp. 3526–3535. Cited by: §II.
-  (2018) Vggface2: a dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pp. 67–74. Cited by: §I.
-  (2002) PicToon: a personalized image-based cartoon system. In Proceedings of the tenth ACM international conference on Multimedia, pp. 171–178. Cited by: §I.
Cartoongan: generative adversarial networks for photo cartoonization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9465–9474. Cited by: §I, §II.
-  (2019) Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5157–5166. Cited by: §I, §I, §II, §IV-B, §IV-B, TABLE I, TABLE II, TABLE III, TABLE IV.
-  (2010) A probabilistic image jigsaw puzzle solver. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 183–190. Cited by: §II.
User-guided deep anime line art colorization with conditional adversarial networks. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1536–1544. Cited by: §II.
-  (2019) Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §I, §IV-B, TABLE I, TABLE II, TABLE III, TABLE IV.
-  (2020) Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In European Conference on Computer Vision, pp. 153–168. Cited by: §I, §II, §IV-B, §IV-B, TABLE I, TABLE II, TABLE III, TABLE IV.
-  (2016) Manga109 dataset and creation of metadata. In Proceedings of the 1st international workshop on comics analysis, processing and understanding, pp. 1–5. Cited by: §II.
-  (2019) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In ICLR, Cited by: §I.
-  (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In European conference on computer vision, pp. 87–102. Cited by: §I.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §III-A, §IV-A.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §IV-A.
-  (2020) Curricularface: adaptive curriculum learning loss for deep face recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5901–5910. Cited by: §I.
-  (2017) Webcaricature: a benchmark for caricature recognition. arXiv preprint arXiv:1703.03230. Cited by: §II.
-  (2021) Shape or texture: understanding discriminative features in cnns. In ICLR, Cited by: §I.
-  (2016) The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4873–4882. Cited by: §I.
-  (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §I.
-  (2011) Guided face cartoon synthesis. IEEE Transactions on Multimedia 13 (6), pp. 1230–1239. Cited by: §I.
-  (2011) Automated assembly of shredded pieces from multiple photos. IEEE transactions on multimedia 13 (5), pp. 1154–1162. Cited by: §II.
-  (2017) Sphereface: deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 212–220. Cited by: §I.
-  (2017) Richer convolutional features for edge detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3000–3009. Cited by: §I.
-  (2011) Personalization in multimedia retrieval: a survey. Multimedia Tools and Applications 51 (1), pp. 247–277. Cited by: §I.
-  (2016) IIIT-cfw: a benchmark database of cartoon faces in the wild. In European Conference on Computer Vision, pp. 35–47. Cited by: §II.
-  (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69–84. Cited by: §I, §II, §III-C.
-  (2020) Solving mixed-modal jigsaw puzzle for fine-grained sketch-based image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10347–10355. Cited by: §II.
-  (2017) Automatic differentiation in pytorch. Cited by: §IV-A.
-  (2011) A fully automated greedy square jigsaw puzzle solver. In CVPR 2011, pp. 9–16. Cited by: §II.
-  (2021) DAF: re: a challenging, crowd-sourced, large-scale, long-tailed dataset for anime character recognition. arXiv preprint arXiv:2101.08674. Cited by: §I.
-  (2020) Dropedge: towards deep graph convolutional networks on node classification. In ICLR, Cited by: §IV-C.
-  (2015) Illustration2vec: a semantic vector representation of illustrations. In SIGGRAPH Asia 2015 Technical Briefs, pp. 1–4. Cited by: §II.
-  (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §I.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §IV-D.
-  (2014) Solving square jigsaw puzzles with loop constraints. In European Conference on Computer Vision, pp. 32–46. Cited by: §II.
-  (2012) Face detection and face recognition of cartoon characters using feature extraction. In Image, Electronics and Visual Computing Workshop, pp. 48. Cited by: §II.
-  (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §I, §III-C.
-  (2011) Sketch2Cartoon: composing cartoon images by sketching. In Proceedings of the 19th ACM international conference on Multimedia, pp. 789–790. Cited by: §II.
-  (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5265–5274. Cited by: §I, §IV-B, TABLE I, TABLE II, TABLE III, TABLE IV.
-  (2020) Learning to cartoonize using white-box cartoon representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8090–8099. Cited by: §I, §II.
-  (2018) A light cnn for deep face representation with noisy labels. IEEE Transactions on Information Forensics and Security 13 (11), pp. 2884–2896. Cited by: §I.
-  (2014) Learning face representation from scratch. arXiv preprint arXiv:1411.7923. Cited by: §IV-B.
-  (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §I.
-  (2014) Data-driven face cartoon stylization. In SIGGRAPH Asia 2014 Technical Briefs, pp. 1–4. Cited by: §I.
-  (2020) Learning from the past: meta-continual learning with knowledge embedding for jointly sketch, cartoon, and caricature face recognition. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 736–743. Cited by: §II.
-  (2020) Cartoon face recognition: a benchmark dataset. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 2264–2272. Cited by: §I, §I, §II, §IV-A, §IV-A, §IV-B, §IV-B, TABLE II, §IV.
-  (2020) Graph neural networks: a review of methods and applications. AI Open 1, pp. 57–81. Cited by: §IV-C.
-  (2016) 3D cartoon face generation by local deformation mapping. The Visual Computer 32 (6), pp. 717–727. Cited by: §I.
-  (2018) ToonNet: a cartoon image dataset and a dnn-based semantic classification system. In Proceedings of the 16th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry, pp. 1–8. Cited by: §II.
-  (2021) WebFace260M: a benchmark unveiling the power of million-scale deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10492–10502. Cited by: §I.