DRAW: Deep networks for Recognizing styles of Artists Who illustrate children's books

04/10/2017 ∙ by Samet Hicsonmez, et al. ∙ University of Bonn Hacettepe University Middle East Technical University 0

This paper is motivated from a young boy's capability to recognize an illustrator's style in a totally different context. In the book "We are All Born Free" [1], composed of selected rights from the Universal Declaration of Human Rights interpreted by different illustrators, the boy was surprised to see a picture similar to the ones in the "Winnie the Witch" series drawn by Korky Paul (Figure 1). The style was noticeable in other characters of the same illustrator in different books as well. The capability of a child to easily spot the style was shown to be valid for other illustrators such as Axel Scheffler and Debi Gliori. The boy's enthusiasm let us to start the journey to explore the capabilities of machines to recognize the style of illustrators. We collected pages from children's books to construct a new illustrations dataset consisting of about 6500 pages from 24 artists. We exploited deep networks for categorizing illustrators and with around 94 performance our method over-performed the traditional methods by more than 10 Going beyond categorization we explored transferring style. The classification performance on the transferred images has shown the ability of our system to capture the style. Furthermore, we discovered representative illustrations and discriminative stylistic elements.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Illustrations help us to understand the message clearly and have been widely used in printed and visual media. Yet, the role of illustrations in children’s books is more than being a simple picture accompanying the text. For the children who don’t know how to read, those are the illustrations that make the children to understand the story. Those images help them to identify the characters, scenes and events in the books and let them to be prepared for the fascinating world of words when they start to read by themselves111http://www.maaillustrations.com/blog/article/the-role-of-illustration-in-childrens-book/.

This fact inspires many artists to draw illustrations for children’s books. On the other hand, understanding, predicting, and analyzing people’s taste of reading is a challenging problem, since the taste can depend on individuals’ philosophical, psychological, political backgrounds. When it comes to children’s books, especially from a child’s perspective the choice mostly depends on the visual illustrations. Discovering the taste requires the understanding of the style characteristics of the illustrators. Motivated by this observation, in this study we aim to understand the style of artists who draw children’s books.

Automatic understanding of artistic images could assist in organizing large collections and could be useful for art recommendation systems. However, it is a difficult task mostly due to varying stylistic behavior of different artists. Particularly with the increase of deep structures there has been an interest towards this relatively less explored area.

There have been recent efforts to understand aesthetic perception of art works such as investigating the potential of a computer to make aesthetic judgments (Spratt and Elgammal, 2014), quantifying creativity (Elgammal and Saleh, 2015), aesthetic analysis of images by feature discovery (Campbell et al., 2015), and analyzing the artistic influence by comparing them to others (Saleh et al., 2014)

. Even though classifying art is qualitative 

(DiMaggio, 1987), classification of art works has also emerged as another line of work. Bar et. al. (Bar et al., 2014) worked on classification of artistic styles by presenting a perceptiveness of deep neural network features in identifying artistic styles in paintings. Li et. al. (Li and Chen, 2009) worked on automatically classifying paintings as aesthetic or not. Lyu et. al. (Lyu et al., 2004) focused on painter authentication. Identification of painters is also studied based on wavelet analysis of brush strokes in paintings (Johnson et al., 2008; Li and Wang, 2004). In (Tan et al., 2016; Saleh and Elgammal, 2015) they aimed to classify fine-art paintings using CNNs on ”Wikiart paintings” (Karayev et al., 2013) data set. In  (Tan et al., 2016) they conducted experiments on their proposed CNN which is very similar to AlexNet (Krizhevsky et al., 2012)

. Best result is achieved when network is first trained on ImageNet dataset 

(Russakovsky et al., 2015)

, then transfer learning applied to the network.

Inspired by capabilities of humans who are able to recognize objects regardless being in art or photography, Cai et. al. worked on automatically identifying objects in cross domains (Cai et al., 2015). In (Crowley and Zisserman, 2014b, a), the authors focus on recognizing objects in paintings learned from natural images.

Collecting and labeling a dataset for artistic images is also a challenging task. Mensink et. al. (Mensink and Van Gemert, 2014) introduced a diverse dataset of over 1 million artworks, 700,000 of which are prints to support and evaluate art classification. Carneiro et. al. (Carneiro et al., 2012) presented a database of monochromatic artistic images. Crowley et al.  (Crowley and Zisserman, 2014b, a) annotate a subset of publicly available ’Your Paintings’ (you, 2012) data set images with ten category labels from the PASCAL VOC data set (Everingham et al., 2011). In  (Khan et al., 2014) presented a dataset which contains 4266 paintings from 91 different painters. Karayev et. al. (Karayev et al., 2013) presented two novel data sets, one of them contains 80K Flickr photographs annotated with 20 style labels such as vintage, romantic, HDR etc., and the second one consist of 85K art paintings from 25 art styles like Baroque, Roccoco, Cubism etc.

Some works concentrated on transferring artistic styles from style images such as paintings to content images such as selfie pictures (Gatys et al., 2016; Johnson et al., 2016; Dumoulin et al., 2017). In (Gatys et al., 2015)

, the artistic style transfer pipeline tries to minimize feature reconstruction loss and style reconstruction loss at the same time by using features from pre-trained CNN model with forward and backward passes. Since backward computations increases computation time, to overcome this, 

(Johnson et al., 2016) proposed a similar approach with using forward passes to minimize both feature and style reconstruction losses. Kyprianidis et. al. (Kyprianidis et al., 2013) presented a survey on state of the art techniques for transforming images and videos into artistically stylized renderings.

The studies that try to identify the style or genre for art images could be considered similar to ours (Saleh and Elgammal, 2015; Karayev et al., 2013; Matsuo and Yanai, 2016; Chu and Wu, 2016). However, they define style as a more generic term shared by several artists. The work in (Thomas and Kovashka, 2015) that identifies the authorship of photographs, that is the photographer, is the most similar one to ours. Deep networks are also utilized in that study for qualitative evaluations.

In the illustrator identification domain, based on our knowledge the only work is (Sener et al., 2012) where they tried to identify only four illustrators on a very small data set. They utilized several low-level descriptors such as HOG, GIST and SIFT and used a bag of words model to classify illustrations. In this work, we collected a larger data set and used their results as our baselines.

In some recent studies, illustrations are considered in the form of clip arts. In (Garces et al., 2014), a style similarity metric is designed by combining color, shading, texture and stroke features with relative comparisons collected via AMT, and this work was leveraged in (Garces et al., 2016) to obtain aesthetically coherent clusters for visualizations of clip art datasets. In (Furuya et al., 2015), an unsupervised approach is proposed for stylistic comparison of illustrations again in the form of clip-arts. The illustrations that we consider are specific to the artistic drawings in children books, and they are more challenging than the illustrations in clip-arts.

Our contributions: We have several important contributions that will be described in detail in the following sections: (1) We attack to the problem of classifying styles of illustrators which is a more challenging task than classifying the content. (2) We have constructed a new dataset of illustrations. Based on our knowledge this is the first comprehensive dataset specific to artistic illustrations from books. (3) We focus on illustrations in children’s books which have distinct characteristics in the sense that the imagination could lead to extreme characters and settings to happen that are difficult to be found in most of the photographs and paintings. (4) We explored different deep networks and compared them with low level features. (5) We tested three different strategies for categorisation: novel instance recognition from seen books as well as unseen books, and book recognition. (6) We exploited the style transfer method and showed the qualitative results for transferring styles from illustrators to cartoon images and natural photographs as well as to the illustrations of other illustrators. (7) Further, we provided quantitative results for illustrator to illustrator transfer utilizing the style categorization. (8) We compared different methods and features in choosing representative illustrators and discriminative patches.

max width= Dataset Dataset Id - Illustrator Book Cnt Image Cnt Id - Illustrator Book Cnt Image Cnt 01- Axel Scheffler 13 532 13- Korky Paul 15 427 02- Ayse Inal 4 120 14- Leo Lionni 10 314 03- Beatrix Potter 9 269 15- Marc Brown 17 360 04- Behic Ak 12 385 16- Maurice Sendak 7 263 05- Bill Peet 11 513 17- Mo Willems 6 268 06- David Mckee 12 199 18- Mustafa Delioglu 9 160 07- Debi Gliori 12 275 19- Patricia Polacco 9 284 08- Dr. Seuss 15 455 20- Ralf Butschkow 6 152 09- Eric Carle 14 304 21- Rosa Curto 8 288 10- Eric Hill 9 148 22- Serap Deliorman 5 158 11- Feridun Oral 5 140 23- Stephen Cartwright 5 179 12- Kevin Henkes 3 86 24- Tony Ross 7 189 Total Number of Illustrators: 24 Books: 223 Illustrations: 6468

Table 1. Summary statistics of our Illustration dataset.
Figure 2. Example illustrations from our data set. Three consecutive illustrations in a column correspond to a single illustrator. The order of illustrators is the same as in Table 1. Note that, the styles are distinctive for illustrators. However, due to the variety in individual’s styles some instances are difficult to categorise correctly.

2. Dataset

We constructed a new data set consisting of 6468 distinct illustrations from 24 different illustrators. Focusing on the popular children’s books, we mostly selected the illustrators who created more than a single basic character. The pages are collected either directly scanning from printed books or from publicly available e-books and read aloud videos over YouTube. Table 1 shows summary of our dataset and Figure 2 represents some example illustrations.

In building the dataset, we are inspired from (Sener et al., 2012) in which a dataset consisting of 248 illustrations of Axel Scheffler, 243 illustrations of Debi Gliori, 249 illustrations of Korky Paul and 234 illustrations of Dr. Seuss was generated. We almost doubled the examples for three of the illustrators, and included 20 other illustrators. Within its current form the dataset is unique: although large scale datasets exist for paintings (Karayev et al., 2013; Crowley and Zisserman, 2014b; Khan et al., 2014), based on our knowledge this is the first comprehensive dataset for illustrations.

Note that, in the painting datasets there are a variety of artists following the same artistic style, and thus the dataset is deep in the sense that the number of examples per style is large. However, each illustrator has only a limited set of books and therefore the number of examples per category is not possible to reach to the numbers in painting datasets. Similarly, the number of categories can only be extended within some limits when we force each illustrator to have more than a single specific character or book series. We continue to extend the dataset and will make it publicly available within the copyright limitations.

3. Discovering style of illustrators

In the following, we will first describe the details of our method in categorizing the style of illustrators using deep networks. Then, we will discuss about approaches to transfer style and to discover representative elements.

3.1. Deep Learning For Style Recognition

Instead of creating a model from scratch, we used three well-known CNN models in training: AlexNet (Krizhevsky et al., 2012), VGG-19 (Simonyan and Zisserman, 2014) and GoogLeNet (Szegedy et al., 2015)

. We used Caffe 

(Jia et al., 2014) framework to train deep networks on a Tesla K40 12GB GPU. We employed both end-to-end training and transfer learning. To train an end-to-end model, we enlarged our data set which is comparably small, by applying data augmentation.

For small data sets like ours, it is not practical and meaningful to fully train very deep networks. Thus, we fully trained only the AlexNet as being relatively shallow. We first subtracted the mean of RGB values over our illustrations dataset for each pixel and obtained the centered raw RGB values. We augmented our training and validation data using only horizontal reflections to reduce overfitting. The batch sizes are chosen as 128 and 40 for train and validation respectively. Base learning rate is set to 0.01 with a momentum of 0.9 and the learning rate is decreased by a factor of 10 after each 40K iterations.

Considering the fact that our dataset is comparably small, alternatively we applied transfer learning. For this purpose, we used VGG-19, AlexNet and GoogLeNet models pre-trained on a large scale ImageNet dataset. Our hyper parameters are nearly the same for fine tuning on AlexNet and VGG-19 except learning rate and batch sizes. Due to the memory issues, we were able to train VGG-19 with train batch sizes of 32. We selected learning rate accordingly and set it to 0.0004. Base learning rate for AlexNet is 0.0001 and all other parameters for SGD are same as end-to-end training. To train GoogLeNet we used quick solver (qui, 2014) properties with initial learning rate of 0.01.

3.2. Style transfer

Inspired by the recent work on transfering artistic style of paintings (Gatys et al., 2016), we transfer the style from one illustration to another. Besides showing the ability of style generation, this task is also important to understand the capability of deep models to capture the style separated from the content.

Style transfer model (Gatys et al., 2016) combines the appearance of a style image, e.g. an artwork, with the content of another image, e.g. an arbitrary photograph, by minimizing the loss of content and style. In our case, style is learned from an illustration of a particular illustrator, and transferred to another image. The target image could be a cartoon, a natural photograph, or another illustration from another artist. We expect the resulting image to contain the content of the target image drawn with the style of source illustration.

However, it is difficult and subjective to judge the quality of the resulting images. In this study, focusing on the style transfers from one illustration to another, we propose to compare the style of the resulting illustration with the original style from the categorisation perspective. Our intuition here is that if we use the resulting image as a test instance on our deep networks, and classify them correctly then we could infer that deep models can capture styles.

3.3. Discovering representatives

Here we try to understand style of illustrators in terms of discriminative and representative examples. We utilised two methods for this purpose. The first method (Doersch et al., 2012) was initially proposed for discovering architectural elements of different cities. It takes a positive set of images from which we want to extract discriminative patches, and a global negative set. It uses HOG features (Dalal and Triggs, 2005) to represent the images. We have used this method both to find representative illustrations for different artists and also for discovering the discriminative parts in the illustrations. However, since this algorithm takes days to complete on a powerful laptop, we were able to run it only for a few of illustrators.

The second method that we utilised (Golge and Duygulu-Sahin, 2015)

focuses on eliminating the outliers from a candidate set of positive examples to capture the representative elements in an iterative fashion. The method was proposed to recognise faces from noisy weakly labeled images collected from web. Being flexible, we exploited this method with HOG 

(Dalal and Triggs, 2005), color dense SIFT (Lowe, 2004), and VGG (Simonyan and Zisserman, 2014) features.

4. Experiments

In this section, we first present detailed experimental evaluations to recognize style of illustrators using deep networks. We also provide experimental results of conventional classification methods as a baseline to compare with deep architectures. Then, we present our results on style transfer and representative element discovery.

4.1. Style recognition with deep networks

We used two different settings for categorisation. In the first setting, we treated each page as an independent instance and constructed training, validation and test sets by randomly selecting instances from the entire collection. In the second setting, we tested a more challenging case, and removed some of the books entirely from the training set. Results of both settings will be discussed in the following.

To analyze and understand the results further, we exploited the method of (Yosinski et al., 2015). Figure 4 shows per-unit visualizations from different layers of VGG-19 network. In every image, first column corresponds to synthetic images which cause high activation using regularized optimization, and second column shows crops from our training dataset that cause highest activation for that unit. As it is shown, our network is able to find parts and objects such as eyes, fish, car/wheel, house, plant, people and clothes, and even discriminate poses such as side views of humans and animals, as well as hair, fur or ears.

Instance categorisation: In this setting, our goal is classify illustrations on a randomly carved data. Here, we don’t care about the books and thus we put all the illustrations from all the books of an illustrator all together and then we construct training, validation and test sets by selecting fixed percentage of the instances randomly.

For this group of experiments we utilized several deep networks including end-to-end training of a network and fine tuning. Table 2 summarizes the results in terms of the network architecture used, test type such as fully training or fine tuning the network, and whether data augmentation is used or not. For all experiments on deep networks, we used 70% of the data as training set and, 10% of the data as validation set. The rest which is 20% is used for testing.

Method Augmentation Accuracy(%)
Alexnet Yes 68.75
VGG 19 No 93.47
VGG 19 Yes 93.24
GoogLeNet No 94.07
Dense SIFT - 82.71
Color Dense SIFT - 84.35
Table 2. Illustration Classification Experiments. Fine Tuning is applied for VGG19 and GoogLeNet networks, while full training is performed for Alexnet.

As expected fully training a deep network gives less accuracy than fine tuning. Thus, in the next group of experiments we focused only on the fine tuning. Also note that, using augmented data for fine tuning a model doesn’t improve the accuracy much. Thus, we preferred not to use augmented data while fine tuning a model. GoogLeNet has much less parameters and less error rate than VGG-19 on ImageNet data set. Our results are in line with the same observation and GoogLeNet beats VGG-19 with a very small difference. Since GoogLeNet has the best performance, in the following experiments we report only the GoogLeNet results. Figure 3 and Table 3

depicts confusion matrix and class-based F1 and accuracy results respectively.

Metrics Metrics
Id F1 Score Accuracy Id F1 Score Accuracy
01 0.96 0.94 13 0.95 1.0
02 0.96 0.91 14 0.99 0.98
03 0.93 0.91 15 0.97 0.96
04 0.94 0.93 16 0.99 0.98
05 0.98 0.98 17 0.98 1.0
06 0.94 0.92 18 0.88 0.78
07 0.74 0.69 19 0.87 0.91
08 0.98 0.99 20 0.93 0.90
09 0.92 0.96 21 0.92 0.91
10 0.95 0.90 22 0.90 0.94
11 0.84 0.85 23 0.95 0.97
12 1.0 1.0 24 0.97 1.0
Table 3. Classification results using GoogLeNet finetuning.
Figure 3. Confusion Matrix for GoogleNet Finetune Test.

Figure 4. Visualizations from different layers. The network is able to capture parts and objects shared in drawings of different illustrators: eyes, wheels, buildings, pointy tree like structures, hairy or furry heads, big ears, human upper body, etc.

Book based instance categorisation: Since illustrators are likely to have varying styles in different books, in this setup we attack a more challenging problem of recognizing the style on novel books. Instead of carving illustrations from one illustrator, we split our data in terms of books into training/validation and test sets. Thus, training and test sets do not share illustrations from the same book. Some illustrators have fewer books than others, but to measure the accuracy we make sure that every illustrator have at least one book in the test set. Note that, this setting is similar to recognizing unseen categories, and especially in the case of domain transfer problem. Leaving out some books mean having unseen characters and contents. Therefore, our recognition performance on this setting proves the capability of our method in recognizing the style but not the specific characters. Notice that the results are lower than the results of instance recognition as expected (see Table 4) .

Book categorisation: We further used this network to predict the illustrator of each illustration book. Note that, in the previous settings our goal was to predict the illustrator of a single page. To predict the illustrator of a book, we used majority voting and selected the illustrator as the one having the largest number of pages assigned. We evaluated the performance of book categorisation with 60 different illustration books using results of VGG-19 model, and obtained 90% accuracy on predicting illustrator of a given book. Table 4 presents the performance on book recognition.

Network book based instance book
categorisation categorisation
VGG 19 78.96 90.00
GoogLeNet 79.27 88.33
Dense SIFT 69.34 -
Color Dense SIFT 70.00 -
Table 4. Experiments on unseen books.

4.2. Style recognition with conventional methods

As a baseline method, we utilized conventional feature extraction methods that are shown to have the highest accuracies in

(Sener et al., 2012). We extracted Dense SIFT (Lowe, 2004) and Color Dense SIFT (Lowe, 2004) features from every illustration and then generated a code book for Bag-of-words (Sivic et al., 2005)

representation using k-means clustering. We use Support Vector Machines to train our model. In particular LIBSVM library 

(Chang and Lin, 2011) is used for SVM classification. We use one versus all approach for training where to prepare the training set for a class, we provide the negative samples from all other classes. A test example is fed into multiple classifiers and it is assigned to the class with the highest confidence value. Half of the data set is used to train SVMs, and the rest is used for testing the models. We observe Hellinger’s kernel boosts the performance by almost 20% over other kernels. As seen in Table 2 and Table 4 the results are much lower compared to the results of deep network architectures.

4.3. Style Transfer on Illustration Dataset

In style transfer experiments, we first selected a simple content image (cartoon image or a natural photograph) gathered from web and has no relation with our data set. Then, we randomly chose a group of illustrations from different illustrators as style images. In our second experiment, we challenged the problem and selected an illustration from our data set as the content image. In this setting, style image is an illustration from our data set, and content image is again an illustration but belongs to a different illustrator. We performed style transfer using each style and content image, and looked for the recognizing performance of our deep model on the resulting images. We use fine tuned GoogLeNet in all style transfer experiments.

Figure 5 illustrates the style transfer results for the given style and content images. As it could be seen, our model mostly succeed to capture the styles, except for ’Debi Gliori’ on both content images, who has the worst classification performance in the previous experiments as well due to large variations in her style.

Figure 5. Example results for style transfer. First column shows selected style images. 2nd, 4th and 6th columns present content images: a simple cartoon, an illustration from a different illustrator and a natural image respectively. 3rd, 5th and 7th columns belong to resulting images. The style images are from Marc Brown, Maurice Sendak, Korky Paul, Dr. Seuss, Debi Gliori and content images are from Ayse Inal, Ralf Butschkow, Rosa Curto, Leo Lionni, Behic Ak in the given order. Red boxes show the failure cases.
Figure 6. First 20 representative instances obtained by (Doersch et al., 2012) (top-left), and by the method of (Golge and Duygulu-Sahin, 2015) using HOG (top-right), color dense SIFT (bottom-left), and VGG19 fine tuned (bottom-right) features.
Figure 7. Representative instances for some other illustrators using the method in (Golge and Duygulu-Sahin, 2015) by VGG19 fine tuned features.
Figure 8. Disriminative patches from Korky Paul obtained by (Doersch et al., 2012). The first four rows correspond to discriminative stylistic parts seen in his illustrations. Note that the writing style is also captured as discriminative. However, as in the last row repeating images at the back pages of all books were also selected as discriminative as a failure case.

4.4. Representative and discriminative elements

First, we aimed to find representative illustrations of each illustrator. As depicted in Figure 6, we compared the method in (Doersch et al., 2012), with the method in (Golge and Duygulu-Sahin, 2015) first using HOG features in both methods. Then, we utilised color dense SIFT and VGG19 fined tuned features with (Golge and Duygulu-Sahin, 2015) as well. Note that, since (Doersch et al., 2012) produces patches while (Golge and Duygulu-Sahin, 2015) gives images, only way to compare results of both algorithms was to find images which contain most of the extracted patches. While (Doersch et al., 2012) is likely to choose the pages with text as considering the font style being discriminative, (Golge and Duygulu-Sahin, 2015) is more likely to capture the style forced by the chosen feature. VGG19 was able to capture the dark colors and the strokes better than the others. Since the visual examples are subjective, in order to quantitatively compare the performance of different methods for selection of representatives we used the categorisation performance. For the first 50 images (Doersch et al., 2012) resulted in 1 incorrect classification and the others reported 100% accuracy. For a better analysis though we should look at the full list and find better comperative measures. Figure 7 shows the representatives for some other illustrators using VGG19 features with (Golge and Duygulu-Sahin, 2015). As a final experiment, we explored the patches extracted by (Doersch et al., 2012) in Figure 8 for the Korky Paul images. As seen, we are able to select stylistic elements like the head of the witch, leafless trees, or furniture, and even the typeface of fonts as discriminative elements.

5. Conclusion

We attacked the problem of recognizing style of illustrators as a pioneering work in this area. On the new dataset constructed we reported qualitative and quantitative results for three different applications: illustrator recognition, style transfer and representative instance selection. In our future work, we plan to expand the dataset with more illustrators. Moreover, better metrics are required to evaluate the quality of style transfer and selection of representatives.