Facial Landmark Detection for Manga Images

11/08/2018 ∙ by Marco Stricker, et al. ∙ 0

The topic of facial landmark detection has been widely covered for pictures of human faces, but it is still a challenge for drawings. Indeed, the proportions and symmetry of standard human faces are not always used for comics or mangas. The personal style of the author, the limitation of colors, etc. makes the landmark detection on faces in drawings a difficult task. Detecting the landmarks on manga images will be useful to provide new services for easily editing the character faces, estimating the character emotions, or generating automatically some animations such as lip or eye movements. This paper contains two main contributions: 1) a new landmark annotation model for manga faces, and 2) a deep learning approach to detect these landmarks. We use the "Deep Alignment Network", a multi stage architecture where the first stage makes an initial estimation which gets refined in further stages. The first results show that the proposed method succeed to accurately find the landmarks in more than 80



There are no comments yet.


page 5

page 6

page 9

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Drawings are used to represent real or imaginary situations, persons, objects, ideas, etc. For the case of painting, some art uses realistic styles such as realism or hyperrealism, but many art styles manipulate the shapes, colors and textures to give special emotions or expressions such as expressionism, cubism, surrealism, etc. As compared to real word pictures, drawings contain a much wider variety of content and is more complex to analyze inoue2018cross .

In a similar way, mangas (Japanese comics) have a large variety of styles, content, and genres such as comedy, historical drama, science fiction, fantasy, etc. Detecting the characters in manga images is not a trivial problem because the protagonists of the stories are quite various, such as: humans, animals, monsters, etc. And even when the characters are humans, their face can have large deformations chu2017manga .

The manga market in Japan is very large. In 2016, The All Japan Magazine and Book Publisher’s and Editor’s Association (AJPEA) reported111http://www.ajpea.or.jp/information/20170224/index.html that the sales of the manga in Japan reached approximately 3.9 billion USD. In this report we can also see that the the digital market almost doubled between 2014 and 2016.

The analysis of comics and mangas images recently sparked the computer vision and document analysis communities interest 

augereau2018survey . The digital version of manga can be used by the researchers to propose new algorithms to provide services such as dynamic visualization of manga augereau2016comic , adding colors kataoka2017automatic , generating animations  gupta2018imagine , creating new kinds of recommender systems daiku2017comic , etc.

In this paper we will start by giving a brief overview of the current state of the art in landmark detection and the work which has been done in the field of manga faces in Section 2. Section 3 will be dedicated to the introduction of our new landmark annotation model for manga faces, and the dataset we created. Following this, Section 4 will give a brief overview about the Deep Alignment Network kowalski2017deep which we have used for detecting facial landmarks on manga images. Section 5 details all the experiments and how we evaluate our experiments. The results are described in Section 6. We will end our paper with a conclusion in Section 7.

2 State of the art

2.1 Landmark detection

Facial landmarks detection, or face alignment consists in localizing several parts of the face in an image such as the eyebrows, eyes, mouth, nose, chin, etc. This task has been heavily investigated for human faces, and is still a very active research domain as it can be seen in recent publications kowalski2017deep ; bulat2017far ; lv2017deep ; wu2017simultaneous . There are different approaches for detecting facial landmarks. Before the success of deep learning, researchers used shape indexed features xiong2013supervised ; ren2014face

. This approach focused on extracting image local features. The latest publications are focusing on deep learning approaches, using mostly CNNs (Convolutional Neural Networks). The emergence of large databases, which are crucial for deep learning approaches, plays an important role in face alignment 

sagonas2016300 ; sagonas2013semi . Instead of using local patches, the Deep Alignment Network method proposed by Kowalski et al. kowalski2017deep is based on the entire image, and consists of multiple stages. Each stage takes the output of the earlier stage as an input, and tries to refine it. Another approach proposed by Lv et al. lv2017deep is improving the initial estimation of landmarks using a deep regression architecture.

A problem in face alignment is facial occlusions, which has being tackled by Wu et al. wu2017simultaneous . They combined landmark detection, pose, and deformation estimation in one approach in order to make the landmark detection robust to occlusions. Aside from the many approaches for human faces, Rashid et al.  rashid2017interspecies proposed a method for animal faces. In this method the animal face is being warped to resemble to a human face, so that normal human face detectors may work on it.

2.2 Landmark models

Several different models to label the facial landmarks have been proposed. One of the simplest models utilizes only five landmarks zhang2016joint : two in the center of the eyes, one for the nose and two for the corners of the mouth. One of the most complex models is based on 194 control points kasinski2008put ; le2012interactive corresponding to the face, nose, eyes, eyebrows, and mouth outlines.

However the iBUG sagonas2016300 model based on 68 landmarks is being used by the 300W sagonas2016300 and Menpo zafeiriou2017menpo challenges seems to be the current state of the art. This model is also used by other libraries such as OpenFace amos2016openface .

2.3 Manga image analysis

For the manga and comics image analysis, some related research has been done for generic object recognition in drawing inoue2018cross

and more specifically the character face detection 

chu2017manga ; jha2018towards .

The landmark detection for cartoon images has been introduced by Jha et al. jha2018towards . The dataset consists of caricature images with different painting style and is available online. The landmark annotations of 750 face images have been made available by the authors. The authors defined 15 landmarks for one face: six for the eyes, four for the eyebrows, one for the nose and four for the mouth. Surprisingly, the authors achieved better performance for the landmark detection by using only real face images in the training set. Using only caricature images or mixing caricature images and real face images degraded the performances of their system.

3 New landmark model

As drawn faces have more variations than real human faces, we propose to use 60 landmarks, including landmarks for the chin contour. In this section we will describe our dataset and our landmark model.

3.1 Data Sources

Availability of manga images which can be used for research and shared publicly are scarce because of copyright issues. However Matsui et al. matsui2017sketch released ”Manga109”, a dataset of 109 manga volumes for research purposes.

Chu et al. chu2017manga used this dataset to train a CNN for detecting the face bounding boxes on manga images. To achieve this, they manually labeled the faces in 66 volumes. They have used 50 pages per title and published a subset of this dataset, which can be freely downloaded MangaFace . This subset contains face bounding boxes from 24 different volumes. We have used these face bounding box information in order to extract the faces from the manga pages.

3.2 Image Selection

From Chu et al. dataset we could extract 5505 face images. However we decided to remove some of these images. First of all, a face may appear in different poses. A major difference exists between a frontal face which shows all facial features and a profile view of the face which only shows half of the features. Therefore, in the current state of the art (iBUG model zafeiriou2017menpo ), two different annotation models exist to describe frontal faces and profile faces. We decided to focus on frontal faces and therefore removed manga faces with a profile view.

After that, we removed all images whose resolution was too small to correctly annotate the landmarks. In an image, the facial features should be clearly distinguishable by a human. If the width or height of an image is smaller than 80 pixels, the image has been removed.

In contrast to human faces which follow regularities, the creativity of a manga authors knows no bounds and therefore faces may have inhuman features and can look like something we would not consider human. The face image need to have exactly two eyes, one nose, one mouth and one chin visible, otherwise it was discarded.

Lastly we removed faces where not both eyes were visible due to glasses or their hair. Our dataset is finally consisting of 1446 images of faces.

3.3 Landmarks Model

We decided to employ a similar model as the iBUG model sagonas2013300 with 68 landmarks. Unfortunately a one to one transition is not possible, since not all features that can be found on a human face can be found on a manga face so we adapted the iBUG model for the manga face images by removing some landmarks and adding some other ones. Figure 1 highlights some typical landmarks that usually appear in a human picture but not in a manga drawing. In manga, the nose of characters is usually simplified in a line (so it is difficult to define an outer part and inner part). The same goes for the mouth, where the lips are usually not drawn but simplified in a line. The corners of the eyes and the mouth are not always clearly drawn in manga images.

Figure 1: Differences between the landmarks of a human face and a manga face. The corners of the eyes and mouth are not drawn. The bottom part of the nose is also missing. Manga face from “PrismHeart” by Asatsuki Mai

In the iBUG model, the landmarks for the eyes are spread equidistantly on both eyelids starting from the intersection of the eyelids. On manga faces, this intersection is rarely drawn. Since many characters have round shaped eyes, we applied 10 landmarks on each eye which are spread equidistantly. This results in an accurate representation of the contour of the eyes, as it can be seen in Fig. 2. We decided to add one landmark to define the pupil of each eye of the character.

We applied the same technique for the mouth and eyebrow of a character. Since they mostly consist of one line in manga images, we just use 10 landmarks to define the line of the lip and 5 for each eyebrow.

Noses in manga faces are mostly simplified compared to human features. The nostrils are not drawn and the nose dorsum line is sometimes only implied. Therefore, we also decided to simplify our landmarks for the nose compared to the iBUG model and use only one landmark to define the tip of the nose, as it can be seen in Fig. 2.

Lastly we followed the landmarks positioning proposed by the iBUG model for the chin contour. To sum up, our landmark model consists of the following 60 landmarks:

  • 5 landmarks for each eyebrow

  • 10 landmarks for each eye

  • 1 landmark for each pupil

  • 1 landmark for the nose

  • 10 landmarks for the mouth

  • 17 landmarks for the chin contour

Figure 2: An example showing how the final landmark model looks like. Manga face taken from “MeteoSanStrikeDesu” by Takuji

3.4 Labeling Process

Each of the 1446 images of our dataset has been labeled by at least one participant. To control the quality of the labeling, 656 images were labeled twice. If the distance between two landmarks was greater than 2 pixels, then these landmarks were manually corrected and compared again. With this procedure, wrong landmarks could be found and corrected. Then the two labels are merged by computing the spatial average for each landmark.

The participants were asked not to put labels if they were visually not present. If the chin contour, the mouth contour and two eyes are labeled, the missing parts have been added automatically: (1) the nose is calculated as the mass center of the two eyes and the mouth, (2) the pupil is computed as being the mass center of two landmarks of the eye. (3) If one eyebrow is missing, then a mapping matrix is computed between the eye landmarks, which describes how to map the landmarks from one eye to the other. After that the existing eyebrow is used with the mapping matrix to calculate the missing eyebrow. (4) If both eyebrows are missing, the landmarks which define the upper eyelid are copied, scaled and translated above the eyes.

4 Proposed method

For the landmark detection we use an adapted version of the Deep Alignment Network kowalski2017deep (DAN). The DAN was published in 2017 and achieves state of the art results for human face alignment in pictures. We adapted the DAN to change the amount of landmarks. As described in section 3.3 we use a landmark model with 60 landmarks instead of the 68 iBUG model for which the DAN was initially build. We are now going to briefly summarize the network.

The DAN is a multi-stage network and uses an approach similar to Cascade Shape Regression (CSR) yang2015facial . It initializes a first face shape

which gets improved in the next stages of the network. In this model, each stage performs feature extraction and regression. A difference between DAN and most CSR methods is, that CSR methods perform feature extraction on patches around the landmarks. DAN uses a landmark heat map as an input in order to extract features from the whole image.

The input of the first stage is the input image. This stage calculates a landmark heatmap and a feature image . After that , and the input image are the input for the second stage. This stage also calculates and which is then forwarded to the next stage. This procedure is repeated for the number of stages.

Each stage calculates the landmark locations and the input for the next stage. The landmark location is handled by a feed-forward neural network, while connection layers are responsible for the input for the next stage.

During the training process, DAN starts training the first stage until the validation errors stops improving. After that, the next stage is included into training until the validation error stops improving again and so forth.

5 Experiment

With the DAN and the above described dataset we performed multiple experiments. Since the DAN can be trained for multiple stages we have trained it for one and for two stages. All experiments have been trained for 150 epochs.

5.1 Split

The dataset is divided into training, test, and validation set by a random split. With a random split, the faces are distributed randomly into training-/test- and validation-set. Therefore each set will likely contain faces from every title. We divided the images into 80% training set, 10% validation set and 10% test set.

5.2 Augmentation

In order to avoid overfitting and to increase the accuracy of the model, we applied data augmentation and increased our training set size by a factor of five. An image got randomly augmented by translation, rotation and scaling. The number by how much an operation was applied is a random number, which follows a Gaussian probability density with different standard deviations as described in equation 

1, where is the mean and is the standard deviation. We applied five random transformations for each image.


For the different transformations we have used following ’s and ’s:

  • rotation angle in degrees: and

  • scaling factor: and

  • translation factors and : and

Furthermore the translation factors and are multiplied by the width and height of the scaled mean shape. The mean shape is calculated by averaging over all landmarks in the training set. Scaling it into a sub-rectangle of the image defined by certain margins results in the scaled mean shape.

Therefore our translations are within rotations of and , the scaling is within 90% and 110% and the translations are independent from the image size. To sum it up, we have run a total of 4 experiments. The combinations are shown in table 1.

Data Augmentation Stages
Yes 1
Yes 2
No 1
No 2
Table 1: All combinations of experiments we have run

5.3 Evaluation

The evaluation is done by calculating the distance between a ground truth landmark and the corresponding predicted landmark normalized by a certain value . For human faces, the normalization factor is often chosen to be the distance between the two eye mass centers. However we decided to introduce a new distance for manga faces.

We decided to use the chin normalized distance instead of the distance between the two eye mass centers. The distance between the two eye mass centers is used, because due to genetic reasons human eyes mostly follow the same face proportions. However this is not true for manga faces. Manga eyes proportions and sizes vary a lot, so that the reader can easily distinguish different characters. In order to avoid such cases, which might influence the performance, we decided to use the distance between the first and the last landmark of the chin contour ”chin normalized distance

” as a normalization factor. This distance is also used as a loss function for training. The

chin normalized distance can be seen in Fig. 3.

For one image, if the average distance (or in other words mean error) over all landmarks is greater than a threshold, then these predicted landmarks are considered a failure. Furthermore we calculated the area under the cumulative distribution curve calculated up to a threshold and then divided by that threshold (denoted ).

In detail we calculate the average distance between ground truth landmarks and predictions on a single face with:


where defines the total number of landmarks, is the prediction coordinate, is the ground truth coordinate and the Euclidean distance. The chin normalized distance is the euclidean distance between the first chin landmark and the last chin landmark as it can be seen in Fig. 3. Therefore the normalized average distance per face is:

Figure 3: Illustration of the chin normalized Distance. Manga face taken from “MeteoSanStrikeDesu” by Takuji

Lastly we consider an image a failure if the is bigger than the threshold of . During the manual labeling phase, we have labeled some images twice. The largest distance between two sets of landmarks for one image that we have encountered was . Therefore humans still consider this distance as the same point. This means, that if our automatic prediction is within this range we can say that this was a success, or at least within human capabilities. The failure rate defines the percentage of images which are considered a failure.

6 Results

First of all we will investigate the training loss curves. For that see figure 4 for a one stage network and figure 5 for a two stage network. As we can see the overall loss on the two stage DAN is lower compared to the one stage DAN. But in both stages the validation loss is significantly higher than the training loss. This means that the network does not generalize well and performs worse on unseen samples. The figures also show that this gap gets smaller if data augmentation is applied. Therefore we assume that our network can still perform better if it has access to more data.

Table 2 shows the mean error, the and the failure rate for all experiments we have run. As it can be seen, the model trained for two stages with data augmentation and random split is the best performing model. In our case data augmentation and two stages also improved the performance. Lastly we also experimented with a three stage model, however it did not improve the performance anymore.

Figure 4: Loss curves for a one stage network
Figure 5: Loss curves for a two stage network
Stages Augmentation Mean Error Failure Rate (%)
1 Yes 0.03933 0.10338 48.28
1 No 0.04355 0.08964 52.41
2 Yes 0.02935 0.24295 19.31
2 No 0.03467 0.16357 37.93
Table 2: Results of the experiments. Highlighted is the best failure rate

Figure 6 is showing the best success and failure cases for the two-stage DAN using augmented data. As we can see, our method performs well for cases where the face is in a normal frontal pose. However it is difficult to find the landmarks on drawings where the facial expressions are exaggerated such as a large and open mouth or closed eyes.

Figure 6: Visualization of the best and worst landmark predictions. The left column shows the best predictions while the right column shows the worst predictions. The images from the left, top to bottom are taken from: PrismHeart by Asatsuki Mai, 2nd and 3rd are Ningyoushi by Omi Ayuko, PlatinumJungle by Shinohara Masami. The images from the right, top to bottom are taken from: RinToSiteShippuNoNaka by Ide Chikae, OL_Lunch by Sanri Youko, RisingGirl by Hikochi Sakuya, ParaisoRoad by Kanno Hiroshi

7 Conclusions

This paper presents three contributions to the facial landmark detection for manga images: (1) we created a landmark annotation model inspired by iBUG for manga faces, (2) we created a manga faces landmark dataset and opened it to the public to encourage further research on this topic, and (3) we implemented one of the first landmark detection model for manga faces and show encouraging results.

Our results can be used to further improve a landmark detection system for manga faces. Furthermore such landmark information can be used for many different applications, for example in improving the accuracy of character recognition systems as it has been shown in jha2018towards or as further information to improve other applications such as emotion recognition. The landmarks can also be used for animating the face of the character or to transfer the emotion or style of one character to another one.


  • (1) Amos, B., Ludwiczuk, B., Satyanarayanan, M.: Openface: A general-purpose face recognition library with mobile applications (2016)
  • (2) Augereau, O., Iwata, M., Kise, K.: A survey of comics research in computer science. Journal of Imaging 4(7) (2018)
  • (3) Augereau, O., Matsubara, M., Kise, K.: Comic visualization on smartphones based on eye tracking. In: Proceedings of the 1st International Workshop on coMics ANalysis, Processing and Understanding, p. 4. ACM (2016)
  • (4) Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In: International Conference on Computer Vision, vol. 1, p. 8 (2017)
  • (5) Chu, W.T., Li, W.W.: Manga facenet: Face detection in manga based on deep neural network. In: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 412–415. ACM (2017)
  • (6) Chu, W.T., Li, W.W.: Manga facenet: Face detection in manga based on deep neural network (2017). URL https://www.cs.ccu.edu.tw/~wtchu/projects/MangaFace/
  • (7) Daiku, Y., Augereau, O., Iwata, M., Kise, K.: Comic story analysis based on genre classification. In: Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 3, pp. 60–65. IEEE (2017)
  • (8) Gupta, T., Schwenk, D., Farhadi, A., Hoiem, D., Kembhavi, A.: Imagine this! scripts to compositions to videos. arXiv preprint arXiv:1804.03608 (2018)
  • (9) Inoue, N., Furuta, R., Yamasaki, T., Aizawa, K.: Cross-domain weakly-supervised object detection through progressive domain adaptation. arXiv preprint arXiv:1803.11365 (2018)
  • (10) Jha, S., Agarwal, N., Agarwal, S.: Towards improved cartoon face detection and recognition systems. arXiv preprint arXiv:1804.01753 (2018)
  • (11) Kasinski, A., Florek, A., Schmidt, A.: The put face database. Image Processing and Communications 13(3-4), 59–64 (2008)
  • (12)

    Kataoka, Y., Matsubara, T., Uehara, K.: Automatic manga colorization with color style by generative adversarial nets.

    In: Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2017 18th IEEE/ACIS International Conference on, pp. 495–499. IEEE (2017)

  • (13) Kowalski, M., Naruniec, J., Trzcinski, T.: Deep alignment network: A convolutional neural network for robust face alignment.

    In: Proceedings of the International Conference on Computer Vision & Pattern Recognition (CVPRW), Faces-in-the-wild Workshop/Challenge, vol. 3, p. 6 (2017)

  • (14) Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: Interactive facial feature localization. In: European Conference on Computer Vision, pp. 679–692. Springer (2012)
  • (15) Lv, J., Shao, X., Xing, J., Cheng, C., Zhou, X.: A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2017)
  • (16) Matsui, Y., Ito, K., Aramaki, Y., Fujimoto, A., Ogawa, T., Yamasaki, T., Aizawa, K.: Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications 76(20), 21811–21838 (2017)
  • (17) Rashid, M., Gu, X., Lee, Y.J.: Interspecies knowledge transfer for facial keypoint detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2 (2017)
  • (18) Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 fps via regressing local binary features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1685–1692 (2014)
  • (19) Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: Database and results. Image and Vision Computing 47, 3–18 (2016)
  • (20) Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: The first facial landmark localization challenge. In: Computer Vision Workshops (ICCVW), 2013 IEEE International Conference on, pp. 397–403. IEEE (2013)
  • (21) Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: A semi-automatic methodology for facial landmark annotation. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on, pp. 896–903. IEEE (2013)
  • (22) Wu, Y., Gou, C., Ji, Q.: Simultaneous facial landmark detection, pose and deformation estimation under facial occlusion. arXiv preprint arXiv:1709.08130 (2017)
  • (23) Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 532–539. IEEE (2013)
  • (24) Yang, J., Deng, J., Zhang, K., Liu, Q.: Facial shape tracking via spatio-temporal cascade shape regression. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 41–49 (2015)
  • (25) Zafeiriou, S., Trigeorgis, G., Chrysos, G., Deng, J., Shen, J.: The menpo facial landmark localisation challenge: A step towards the solution. In: in Proc. IEEE Conf Comput. Vision Pattern Recognit. Workshops, pp. 2116–2125 (2017)
  • (26) Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23(10), 1499–1503 (2016)