Comic books are an under-analyzed cultural resource that combine visual elements with textual components. Despite the common assumption that the target audience of comic books are children, their topics and story line are often elaborate, addressing a mature audience. The label “graphic novel” has been created to better market comic books targeting adult readers. Comics are visual narratives with a sequential structure, and their comprehension requires a specific visual literacy from the reader. The structure of visual narratives is complex, and can be divided into modality, grammar, and meaning, each of which can be described at the level of the individual unit as well as of a sequence . Most of the work from the document analysis perspective to date has focused on the graphic and morphological structure, and hence on the individual unit level in that scheme.
As comic book pages are complex documents, the analysis, detection and segmentation of their constituent parts is an interesting problem for automated document analysis. Typically a comic book consists of pages divided into panels according to a variety of layout schemes. Panels in turn contain pictorial content as well as several types of text containers such as captions and speech balloons, which usually differ in function. Panels may also contain onomatopoeia and diegetic text. The individual elements just mentioned can also be distributed across several panels or bleed into the void between panels called the “gutter”. Recent overviews of comics research can be found in Cohn , Augereau et al. , and Dunst et al. 
. Here we focus on one particular element of graphic structure, namely, speech balloons. Most of the text in comics is located in speech balloons and captions, thus locating these elements is a prerequisite for OCR. However, it also pays to classify these as different classes, because they serve different functions: In contrast to captions, which are are normally used for narrative purposes, speech balloons typically contain direct speech or thoughts of characters in the comic. Speech balloons usually consist of a carrier, “a symbolic device used to hold the text” (Cohn, 2013b), and a tail connecting the carrier to its root character from which the text emerges. Both tails and carriers come in a variety of shapes, outlines, and degrees of wiggliness. Here we describe a method for the automated detection and segmentation of speech balloons of various forms and shapes, including their tails.
Our method is based on a fully convolutional network , specifically on a modified U-Net architecture  in combination with a pre-trained VGG-16 encoder . We are thus using an encoder that was trained for classification of natural images, and an architecture which was originally developed for medical image segmentation. By showing that this combination achieves state-of-the art performance with illustrations, we also suggest that transfer might generalize to more or less related tasks. Automated detection of speech balloons and other typical elements of comic books is not only desirable for research purposes such as creating and analyzing larger datasets of annotated comic books, or as a first step in an optical character recognition (OCR) pipeline, but also has potential for applications such as preparing digitized comics for computerized presentation on handheld devices, or developing assistive technologies for people with low vision.
I-a Related work
Recently, deep convolutional neural networks (DCNNs) using learned feature hierarchies have been shown to perform better in many visual recognition and localization tasks than other algorithms that use carefully crafted engineered features. However, partly due to the lack of big corpora there are only very few publications tackling the task of analyzing comic books using deep neural networks. We will describe some of this recent work on DCNN-based object detection in comics below, but first summarize some of the older work based on engineered features, focusing on speech balloon detection.
Arai and Tolle 
were the first to attempt speech balloon detection. They limited themselves to simple layouts without overlap or bleeding, i.e., all balloons are fully contained within the panels. They first extracted panels using a connected-components approach. Some morphological operations are then applied to panels, followed by another connected component run, from which candidates are selected using some heuristics such as white pixel ratio or size and height relative to the panel dimensions. Some of these heuristics assume vertical text, so the applicability is limited to Manga.
Ho et al.  use a similar approach, but start by identifying candidate regions through their (white) color. Next they apply some size and shape criteria, then use connected components to locate text within the balloon, followed by a morphological dilatation operation, and another size threshold. The result is a bounding box of the text. Again, this approach only works with simple balloons.
Rigaud et al. , realizing the great variety and heterogeneity of balloons in the wild, which are often defined by irregular shapes or even illusory contours, developed a decidedly different approach based on active contours . Following edge detection by a Sobel operator, and text detection (and removal) as an important initialization step, Rigaud et al.  introduce new energy terms for the active contour model based on domain knowledge, such as strong edges, smooth contours, and the relative location of text. This is the first proposal that can delineate the full shape of a speech balloon and handle illusory or subjective contours.
Whereas this method requires the location of text, Rigaud et al. 
propose a method in which speech balloons are located first, and text is only used to give a confidence rating on whether a candidate region is indeed a speech balloon. This method first uses an adaptive binarization threshold, following by a connected components approach to identify (a) speech balloon candidates and (b) text within speech balloons. Some criteria on typical alignment and centering of text within a balloon are used to arrive at a confidence score. One drawback of this method in comparison toRigaud et al.  is that it only appears to work with closed balloons.
Let us now turn to DCNN approaches. Since we know of only one attempt at speech balloon segementation proper, we extend the scope a little. The general aim is to show that features which are typically learned on photographs of natural scenes can nevertheless be used to describe illustrations such as comics. More specific applications in object detection and segmentation are included.
Chu and Li 
used deep neural networks for face detection in Manga. Rigaud and colleagues, who had previously used several methods with hand crafted features to detect elements of comic book pages (see above), have switched to convolutional neural network features in more recent work[27, 28]. Nguyen et al.  describe a multi-task learning model for comic book image analysis derived from on Mask R-CNN  that defines the current state-of-the-art in detection and segmentation of several object categories such as panels, characters, and speech balloons. Ogawa et al.  used convolutional neural network features for object detection on the annotated Manga109 corpus.
Iyyer et al.  attempted to model a striking sequential aspect of comics, namely the inference required from the reader to bridge the gutter, i.e., to make sense of panel transitions and coherently integrate them into the mental model of the text. The multilevel hierarchical LSTM architecture they used to model sequential dependencies in text and image semantics is beyond the scope of the present paper, but in passing they provided an annotated dataset of comics in the public domain from the “Golden Age of Comics”. Specifically, they (a) annotated 500 randomly selected pages for training a Faster R-CNN 
to segment pages into about 1.2 million panels. Furthermore, they (b) annotated 1500 panels for textboxes for training another Faster R-CNN combined with ImageNet-pretrained VGG feature encoding to segment 2.5 million textboxes, and (c) extract text from these textboxes using Google’s Cloud Vision OCR. This is certainly a data set that deserves further exploration. For the current purposes, the main problem is that text elements are annotated using rectangular bounding boxes, and without differentiating between captions and speech balloons. Thus segmentations based on these annotations lack detail and specificity. Generally, however, their results definitely show that pretrained features obtained from training on ImageNet can very well be re-used for comics document image analysis tasks.
Laubrock and Dubray  have similarly shown that deep convolutional neural networks are well-suited to capture graphical aspects of comics books and can successfully be used in categorizing illustrator style, pacing the way for a visual stylometry. Laubrock et al.  show that eye-tracking data obtained from comics readers as a measure of visual attention and CNN predictions of empirical saliency based on Deep Gaze II  tend to focus on similar regions of comics pages. Regions containing text such as speech balloons or captions are focused most often. For the image part, empirical attention allocation appears to be object-based, with regions containing faces receiving a high proportion of fixations.
I-B The present approach
Building on these earlier successes, here we use a deep neural network to learn speech balloon localization and segmentation of annotated pages from the Graphic Narrative Corpus . We regarded speech balloon detection as a semantic segmentation problem, i.e., a pixel-wise classification task. By partitioning a scanned image into coherent parts we aimed to classify for each pixel whether it belongs to a speech balloon or not. We chose a fully convolutional encoder-decoder architecture based on U-Net , i.e., there are no densely connected layers in the model. The network can be divided into an encoding and a decoding branch, devoted to describing the image using a pre-trained network and semantically-guided re-mapping back to input space, respectively. The encoding branch uses a standard VGG-16 model  pre-trained on ImageNet 
that employs several hierarchical convolutional and max-pooling layers to extract features useful in describing the stimulus. It therefore encodes the image into more and more abstract representations guided by semantics. The higher the encoding level, the less fine-grained the spatial resolution and the more condensed the semantic information. We have previously shown that re-using weights obtained from pre-training for an object recognition task transfers well to graphic novels. We chose VGG for encoding because of the conceptual simplicity of its architecture that allows for easy upscaling, but note that other architectures that outperform VGG in image classification (such as Xception, ) can in principle be plugged in.
The decoding branch of the network performs upsampling and is trained to do the segmentation. The decoder thus semantically projects discriminative features learnt by the encoder onto the pixel space of the input image in order to get a dense classification. Intermediate representations from the encoding branch are copied to and combined with the decoding activations to achieve better localization and learning of features in the upsampling branch.
Our training material contains roughly 750 annotated pages of the Graphic Narrative Corpus , representing English-language graphic novels with a variety of styles. Parts of the GNC are annotated by human annotators with respect to several constituent parts such as panels, captions, onomatopoeia, or speech balloons. The material includes some non-standard examples of balloons (e.g., speech balloons outside of the panel or balloons looking like captions) but mainly consists of typical balloons from a large variety of 90 comic books.
Binary mask images were generated from ground truth corpus annotations of speech balloons, which are represented by lists of polygon vertices in an XML format. For the balloon segmentation task, we downscaled the corpus images to a fixed size of pixels in RGB.
Ii-B Model architecture
We used a fully convolutional approach predicting a pixel-based image segmentation. The network is based on the U-Net architecture and can be divided into an encoding and a decoding branch (see Figure 1).
The encoding branch uses the convolutional part of the VGG-16 model. Each image is downsampled subsequently to five encoding representations, each halving the spatial height and width of the previous representation. This is simply achieved by using the output of the five convolutional blocks of the VGG-16 model (Figure 1, left side).
The decoding branch decodes up to the original resolution. Each of the five upsampling steps (Figure 1
, right side) doubles the width and height of an encoding representation. Upsampling is performed by transposed convolutions. At each upsampling step, skip connections are used, meaning that copies of the encoding representation corresponding to the resulting width and height are additionally included by concatenation. A convolutional operation was performed on the concatenated representation. A sigmoid activation function as the last step of decoding results in a
representation of floating values in (0,1), which can be considered to give the probability of each pixel to belong to a speech balloon. The decoder thus outputs a pixel-based prediction about the location of the speech balloons, which was compared to binary mask images of our annotated speech balloons.
Ii-C Loss function
We used the sum of the pixel-wise binary cross entropy loss as a sum over each pixel in the set of pixels
and the Dice loss
of the model prediction and the binary image mask of the balloon annotations as loss function for training.
Ii-D Training and augmentation
We split our dataset into a training and validation set with the ratios 0.85 and 0.15 on the total set, respectively. This split was done randomly with a stratifying condition ensuring to split each of the included comics books with this ratio (some comic books were pooled due to the low amount of annotated images). First, all image values were linearly normalized to the interval [0,1]. Then, on the training set we applied image augmentation randomly using rotation of the hue channel of the HSV image (range 0 to 0.3), height and widths shifts (range: 0.2 of each dimension size) and flipping the image horizontally and vertically. We trained the model using an Adam optimizer (learning rate of , ,
) for a total of 500 epochs, with each epoch going through the full training set. In order to fight overfitting, weight decay (L2 regularization) with a parameter ofwas added to the final convolutional layer of each decoding block, which had an effect similar to dropout.
First, we report the results of five different splits of training and validation data. Due to the regularization term in addition to random fluctuations, results can vary from epoch to epoch; in order to obtain a more representative value we smoothed the values by computing the median of the metrics on the validation set for the last five epochs. The values for binary cross entropy loss, the Sørensen-Dice coefficient, precision and recall are reported in TableI.
To put our results into perspective, we compare the values with previous work by adapting a table from Nguyen et al. (2019). These results were obtained from predicting on the eBDtheque data set , which in its version 2 includes pixel-based ground truth. In order to facilitate comparison, we also computed predictions of our model (trained on the GNC) for the eBDtheque corpus. Table II lists the values for recall, precision, and score (we used the average over five different runs on different test sets). Note that we did not use the eBDtheque material for training, whereas the other models did. Therefore, in order to better describe our model performance, we additionally present results obtained for our own test set of pages from the GNC that were excluded from training (Table II, last row).
Our model generally transfers quite well to the new data set. One problem is that eBDtheque annotations do not distinguish between speech balloon and caption, whereas our model has learned to distinguish between them. We thus manually added this distinction to the ground truth. Another problem for our model is the inclusion of a few Manga images with Japanese script in the eBDtheque. We had previously observed that our model currently fails miserably to predict speech balloons in the Manga109 data set . Whereas we can currently only speculate that this may be due to the some sort of basic alphabetic character recognition and/or vertical orientation of the speech balloons, it gave us an a priori reason to exclude the 6 manga images from evaluation, which boosted the score quite a bit. Finally, we observed that out model did very well on the vast majority of images, but performed poorly on a few images with rather untypical speech balloons (like the very abstracted balloons in TRONDHEIM_LES_TROIS_CHEMINS__00x.jpg). The failure to capture these extreme examples is due to a relative lack of diversity of our training set. To get a more representative measure of the performance, we therefore additionally included the median per measure over images. In summary, our model outperforms previous approaches in recall and , and is at least competitive in precision.
Iii-a Qualitative results
Figures 2-5 show some qualitative results, with colours indicating confidence of speech balloon classification (values according to matplotlib’s  colormap hot, with black: 0, white: 1). Figures 2 and 3 are from the training set, whereas the remaining Figures are from the test set. Figure 2 illustrates that the model can (a) well approximate curved tails, and (b) has learned to differentiate between speech balloons and other regions containing text. Figure 3 demonstrates that the model can approximates illusory contours, i.e., boundaries of speech balloons that are not defined by physical lines, but just by imaginary continuoations of the gutter lines in this case. Classification confidence drops near those illusory contours. Figure 4 shows that good performance is also achieved on test set images. Figure 5 (upper part) illustrates that sometimes illusory contours are hard to detect for the model. Furthermore, a false alarm in the lower part of the figure is probably triggered by a shape that looks similar to a tail of a speech balloon.
We obtained state-of-the-art segmentation results for speech balloons in comics by training a fully-convolutional network model based on the U-Net architecture combined with VGG-16 based encoding. The model was able to fit and predict subtle aspects of speech balloons like the shape and direction of the tail, the shape of the outline, or even the approximate end of speech bubbles that were partly defined by illusory contours (Kanizsa, 1976).
Whereas the overall performance is impressive, inspection of misclassification cases is instructive. Most of these appear to be solveable by some heuristic post-processing using methods similar to those described in I-A. Additionally, a relatively high confidence threshold would help reduce the number of false alarms. For example, sometimes the model is weakly confident about having detected a balloon in a bright area that has a boundary akin to a circular arc. However, the confidence for these regions is usually lower than for real balloons, and the detected area is also significantly smaller. Thus, some thresholding on confidence as well as size might be helpful for production use, even though less desirable in terms of using a unified approach. Another class of misclassification is misses (omissions); inspection of these suggest that they were often ballons with very large fonts, and sometimes jagged boundaries. Since very large typeface is otherwise mostly used for onomatopeia, we think it is possible that the model has learned that large fonts are usually associated with something else than speech balloons.
One limitation of the current model is related to the training material. In testing transfer to new data sets we observed that whereas predictions for the eBDtheque data set were qualitatively good, the model in its current form failed quite miserably on the Manga109 data set [26, 29]. We speculate that the model currently encodes some culture-specific features such as letter identities of the latin alphabet, the horizontal orientation of text lines, or the horizontally elongated shape of most speech balloons in our training material. An updated version that includes more Manga training material is under development.
Our original aim for successful detection was to integrate the semi-automated speech balloon detection in our in-house annotation tool (Dunst et al., 2017). Even though prediction times are pretty fast, they will introduce a noticeable delay in page loading, thus a batch mode is planned. Regions that are predicted to be speech balloons will be represented by their polygonal outline, which can then be accepted or corrected by the annotator. Alternatively, means of optimizing the run time per image will be explored.
Another obvious for future research is to extend the model to allow segmentation of different classes, such as captions, onomatopoeia, and even characters. The general success of segmentation model in multiclass labeling as well as the specific results reported by Nguyen et al. (2019) in their multitask model suggest that this will be quite feasible. A more general use of this kind of algorithm (using different training data) would be any segmentation task using real-world features. For example, we expect that the present approach may generalize to other segmentation tasks such as segmenting historical manuscripts into text and image, scientific articles into text, figures and tables, segmenting newspaper articles into paragraphs and images, or segmenting medical scans according to presence or absence of a disease. Of course, human-assisted verification is needed, but given the fact there are now several computer vision domains where the performance of CNN-based models is at least close to human performance, we expect to be able to solve several tedious annotation tasks, freeing human resources for more interesting endeavours.
This work was funded by BMBF grant 01UG01507B to JL. The authors would like to thank Christophe Rigaud and the eBDtheque team for making available their database.
- DBL  2nd International Workshop on coMics Analysis, Processing, and Understanding, 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9-15, 2017, 2017. IEEE. URL http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=8269930.
- Arai and Tolle  Kohei Arai and Herman Tolle. Method for real time text extraction of digital manga comic. International Journal of Image Processing (IJIP), 4(6):669–676, 2011. URL http://www.cscjournals.org/manuscript/Journals/IJIP/Volume4/Issue6/IJIP-290.pdf.
- Augereau et al.  Olivier Augereau, Motoi Iwata, and Koichi Kise. An overview of comics research in computer science. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) DBL , pages 54–59. URL https://doi.org/10.1109/ICDAR.2017.292.
- Chollet et al.  François Chollet et al. Keras. https://keras.io, 2015. URL https://keras.io.
- Chollet  François Chollet. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1800–1807, July 2017. URL http://arxiv.org/abs/1610.02357.
- Chu and Li  Wei-Ta Chu and Wei-Wei Li. Manga facenet: Face detection in manga based on deep neural network. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, ICMR ’17, pages 412–415, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4701-3. URL http://doi.acm.org/10.1145/3078971.3079031.
- Cohn  Neil Cohn. The visual language of comics: introduction to the structure and cognition of sequential images. Bloomsbury advances in semiotics. 2013. ISBN 9781441170545.
- Cohn and Magliano [in press] Neil Cohn and Joseph P Magliano. Why study visual narratives? A framework for studying visual narratives in the cognitive sciences. TopiCS, in press.
- Deng et al.  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. URL https://doi.org/10.1109/CVPR.2009.5206848.
- Dorkin et al.  E. Dorkin, J. Thompson, and J. Arthur. Beasts of Burden: Animal rites. Vol. 1, volume 1 of Beasts of Burden: Animal Rites. Dark Horse Books, Milwaukie, OR, 2010.
- Dunst et al.  Alexander Dunst, Rita Hartel, and Jochen Laubrock. The graphic narrative corpus (GNC): design, annotation, and analysis for the digital humanities. In 2nd International Workshop on coMics Analysis, Processing, and Understanding, 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9-15, 2017 DBL , pages 15–20. URL https://doi.org/10.1109/ICDAR.2017.286.
- Dunst et al.  Alexander Dunst, Jochen Laubrock, and Janina Wildfeuer, editors. Empirical Comics Research: Digital, Multimodal, and Cognitive Methods. Routledge, New York, 2018. ISBN 9781138737440. URL https://doi.org/10.4324/9781315185354.
- Guérin et al.  Clément Guérin, Christophe Rigaud, Antoine Mercier, Farid Ammar-Boudjelal, Karell Bertet, Alain Bouju, Jean-Christophe Burie, Georges Louis, Jean-Marc Ogier, and Arnaud Revel. eBDtheque: a representative database of comics. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR), pages 1145–1149, 2013. URL https://doi.org/10.1109/ICDAR.2013.232.
- He et al.  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017. URL http://arxiv.org/abs/1703.06870.
- Hernandez  Gilbert Hernandez. Heartbreak soup: The first volume of ‘Palomar’ stories from Love & Rockets. Fantagraphics, Seattle, Wash., 2007. ISBN 1-56097-783-3.
- Ho et al.  Anh Khoi Ngo Ho, Jean-Christophe Burie, and Jean-Marc Ogier. Panel and speech balloon extraction from comic books. In 2012 10th IAPR International Workshop on Document Analysis Systems, pages 424–428, March 2012. URL https://doi.org/10.1109/DAS.2012.66.
- Hunter  J. D. Hunter. Matplotlib: A 2d graphics environment. Computing In Science & Engineering, 9(3):90–95, 2007.
- Ioffe and Szegedy  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
- Iyyer et al.  Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daumé III, and Larry Davis. The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Kass et al.  Michael Kass, Andrew Witkin, and Demetri Terzopoulos. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321–331, Jan 1988. ISSN 1573-1405. URL https://doi.org/10.1007/BF00133570.
- Kompatsiaris et al.  Ioannis Kompatsiaris, Benoit Huet, Vasileios Mezaris, Cathal Gurrin, Wen-Huang Cheng, and Stefanos Vrochidis, editors. MultiMedia Modeling - 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8-11, 2019, Proceedings, Part II, volume 11296 of Lecture Notes in Computer Science, 2019. Springer. ISBN 978-3-030-05715-2. URL https://doi.org/10.1007/978-3-030-05716-9.
- Kümmerer et al.  Matthias Kümmerer, Thomas S. A. Wallis, Leon A. Gatys, and Matthias Bethge. Understanding low- and high-level contributions to fixation prediction. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017. URL https://doi.org/10.1109/ICCV.2017.513.
- Laubrock and Dubray  Jochen Laubrock and David Dubray. CNN-based classification of illustrator style in graphic novels: Which features contribute most? In Kompatsiaris et al. , pages 684–695. ISBN 978-3-030-05716-9. URL https://doi.org/10.1007/978-3-030-05716-9_61.
- Laubrock et al.  Jochen Laubrock, Sven Hohenstein, and Matthias Kümmerer. Attention to comics: Cognitive processing during the reading of graphic literature. In Dunst et al. , pages 239–263. ISBN 9781138737440. URL https://doi.org/10.4324/9781315185354.
- Lewis et al.  John Lewis, Andrew Aydin, Nate Powell, and Chris Ross. March. Book two. Top Shelf Productions, Marietta, GA, 2015. ISBN 9781603094009.
- Matsui et al.  Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 76(20):21811–21838, Oct 2017. ISSN 1573-7721. URL https://doi.org/10.1007/s11042-016-4020-z.
Nguyen et al. 
Nhu-Van Nguyen, Christophe Rigaud, and Jean-Christophe Burie.
Comic characters detection using deep learning.In 2nd International Workshop on coMics Analysis, Processing, and Understanding, 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9-15, 2017 DBL , pages 41–46. URL https://doi.org/10.1109/ICDAR.2017.290.
- Nguyen et al.  Nhu-Van Nguyen, Christophe Rigaud, and Jean-Christophe Burie. Multi-task model for comic book image analysis. In Kompatsiaris et al. , pages 637–649. ISBN 978-3-030-05716-9. URL https://doi.org/10.1007/978-3-030-05716-9.
- Ogawa et al.  Toru Ogawa, Atsushi Otsubo, Rei Narita, Yusuke Matsui, Toshihiko Yamasaki, and Kiyoharu Aizawa. Object detection for comics using Manga109 annotations. CoRR, abs/1803.08670, 2018. URL http://arxiv.org/abs/1803.08670.
- Ren et al.  Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015. URL http://arxiv.org/abs/1506.01497.
- Rigaud et al.  Christophe Rigaud, Jean-Christophe Burie, and Jean-Marc Ogier. Text-independent speech balloon segmentation for comics and manga. In Bart Lamiroy and Rafael Dueire Lins, editors, Graphic Recognition. Current Trends and Challenges, pages 133–147, Cham, 2017. Springer International Publishing. ISBN 978-3-319-52159-6. URL https://doi.org/10.1007/978-3-319-52159-6_10.
- Rigaud et al.  Christophe Rigaud, Jean-Christophe Burie, Jean-Marc Ogier, Dimosthenis Karatzas, and Joost van de Weijer. An active contour model for speech balloon detection in comics. In 12th International Conference on Document Analysis and Recognition, ICDAR 2013, Washington, DC, USA, August 25-28, 2013, pages 1240–1244. IEEE Computer Society, 2013. ISBN 978-0-7695-4999-6. URL https://doi.org/10.1109/ICDAR.2013.251.
- Ronneberger et al.  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, Cham, 2015. Springer International Publishing. ISBN 978-3-319-24574-4. URL https://doi.org/10.1007/978-3-319-24574-4_28.
- Shelhamer et al.  Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic segmentation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015. URL https://doi.org/10.1109/CVPR.2015.7298965.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. URL http://arxiv.org/abs/1409.1556.
- Tomine  Adriane Tomine. Shortcomings. Drawn & Quarterly, Montreal, 2007.