Libraries are often interested in analyzing handwritten annotations in historic manuscripts or prints. Such annotations can give hints to the provenience of the documents or provide additional information about the readers’ thoughts.
For successful handwriting recognition in document images, a multi-stage processing pipeline is needed. Before an algorithm can analyze handwritten annotations in a document, it is important to localize the areas in this document that contain handwriting. Therefore, it is crucial to implement a robust algorithm for finding these areas as one of the first steps in the processing pipeline.
Due to the huge variety in the layout of these documents, it is very difficult to come up with a rule-based algorithm that can reliably find handwritten annotations in unseen documents. Many previous approaches aim to segment the documents into regions in a top-down manner, bottom-up or, using texture-based methods .
and LSTMs (Long Short-Term Memory) neural networks[9, 10]. Although the results are promising with both the approaches, the Convolutional Neural Network (CNN) based approaches are becoming popular because of their efficiency. These networks which are known as Fully Convolutional Neural Networks (FCNNs) outperformed previous methods. Most recently, FCNNs have also been used to segment historical document images [11, 12, 13]. The methods we evaluate on our dataset are FCNN based.
Our dataset consists of images which are divided into representative training images and test images. The images are taken from multiple manuscripts and feature different layouts and various kinds of annotations (cf. Section III). For each image, there exists a pixel-wise ground truth (GT) with two classes: handwritten annotation and background (cf. Fig. 2). The task associated with this dataset is to classify each pixel of the input images and to maximize the mean IoU score.
The contribution of our work is twofold: First, we present a new challenging dataset for handwritten annotation detection. Secondly, we train and evaluate multiple methods for segmentation on this dataset.
The remaining sections of this paper are organized as follows: Section II briefly describes related work in the fields of both handwriting recognition and deep learning. In Section III, the new dataset and its unique challenges are illuminated. The network architectures we use are presented in Section IV. The experimental setup and training details are described in Section V. Section VI reports the results of the experiments that is followed by the discussion in Section VII. At the last, Section VIII concludes the paper and gives perspectives for future work.
Ii Related Work
Traditional approaches for the semantic segmentation of documents build on machine learning methods applied to hand-crafted features. Typically, the contribution is extracting good feature representations of the document and feeding them to a classifier trained on the training data[14, 15, 16].
Zagoris et al. used Bag of Visual Words model to extract local image features which are fed to an SVM based classifier for segmentation of key points 
. This approach, however, requires binarization and Hough transform as preprocessing operations which can be dependant on the type of the documents.
Chen et al.  extract color and coordinate based handcrafted feature representations from the colored images, and the obtained features are fed to an SVM classifier. The method works directly on colored images without any pre-processing operations like binarization. However, a threshold based post-processing step is required which can be dependent on training data.
Deep learning has been the key success factor for many of the fields related to computer vision and pattern recognition such as document analysis [17, 18]. Recently, many approaches for document classification [19, 20, 21], binarization [22, 10, 23], historical document segmentation [16, 12, 24], image format detection , multiple script identification [26, 27], character recognition of difficult scripts [28, 29] etc. Following is the description of closely related methods of semantic segmentation that put the proposed work into perspective.
Very recent approaches for the segmentation of historical documents use features extracted from deep-learning based architectures like auto-encoders. Such architectures learn the features automatically from unlabeled given data. The segmentation task here is modeled as a pixel-labeling task where each pixel is assigned a label in the image. Chen et al. used features extracted from auto-encoders which are classified using an SVM classifier.
CNN based approaches have also been used in a similar fashion where features are obtained automatically from the last or second last fully connected layer of a CNN. Chen and Seuret  generate small image patches using a superpixel algorithm and then a CNN is applied over the fixed-sized patches to classify image patches called as superpixels into respective classes.
Most similar to our work is the application of FCNN based methods which have successfully been applied to various datasets [7, 8, 13]. In case of documents, Yang et al. use a multimodal FCNN for extracting semantic structures from document images . In addition to FCNN, it combines the unsupervised reconstruction task with the pixel-labeling task and added a text embedding map to take into account the content of underlying text for better extraction of semantic structures like figure, table, heading, and paragraph. These methods, however, have been applied to more homogeneous data. In this paper, we study the effectiveness of FCNNs on diverse data of challenging historical documents.
Iii The Dataset
The actual success or failure of any model is truly dependent on the type of dataset used for its training and testing. In the last decade, rich sets of historical manuscript databases have been collected, such as DIVA-HisDB , ENP  and IMPACT . For our experiments, we introduce a challenging dataset for training and benchmarking our approaches for handwritten annotation segmentation.
Our new dataset comprises 40 images for training and validation and 10 images for testing. The images with their respective GT in the well-known PAGE format  are kindly provided by
Universitätsbibliothek der Humboldt-Universität zu Berlin
Figure 2 illustrates different patches of sample images from our dataset.
An interesting feature of the dataset is that it contains document pages from multiple sources which are digitized using different devices. This increased variance makes our dataset especially challenging for segmentation tasks. While in other datasets the background and foreground colors of the images are typically similar, they are very heterogeneous in our dataset. Not only do the font styles and sizes in our dataset vary significantly for both printed text and annotations, but also the line spacing and the overall layout of the pages are different. Lastly, both the sizes and the aspect ratios of the images differ significantly (cf. TableI) which exacerbates the challenge of different fonts and layouts even further.
The handwritten annotations in the images are of various nature. They are often comments on the sides and underlines of the printed text. However, sometimes the annotations are also written between printed text lines. Sometimes, the document may even consist of handwriting only (cf. Fig. 2, which is a note written on an empty page). Also, the indentations of the handwritten annotations change, sometimes even within the same document page. A challenge regarding the underlines is to distinguish handwritten lines from horizontally printed lines.
Since the GT is available in PAGE format, we first convert it to two-color images. To achieve this, we use the DIVA-HisDB-PixelLevelLayout333https://github.com/DIVA-DIA/DIVA-HisDB-PixelLevelLayout tool. It binarizes the images and overlays it with the polygons which are described in the GT PAGE file. The pixels which are black after the binarization and within the annotation polygons are considered annotation pixels, the pixels which are inside the polygons and white are marked as ambiguous and are not considered in evaluation (cf. Fig. 3). All pixels outside of the GT polygons are labeled as background. Since the binarization of the images is not perfect, the generated GT images can contain some incorrectly labeled pixels. However, these are very few and can be neglected.
All training and testing images with their corresponding GT annotations in both PAGE and PNG format can be downloaded at http://tc11.cvc.uab.es/datasets/AnnotationDB_1.
Iv Network Architectures
In this section, we describe the FCNN architectures we used to segment the images into the two classes: Handwritten annotations and background.
The FCN-8s is a multi-stream FCNN architecture that was proposed by Long et al.in 2015  and scores a mean IoU of on the segmentation task of the Pascal Visual Object Classes (VOC) Challenge . The architecture is based on the famous VGG-16  architecture which performed best on the localization task of the ImageNet Large Scale Visual Recognition Competition (ILSVRC)  in 2014.
Just like VGG-16, FCN-8s employs five stacks of convolutional layers with appended max-pooling. Each of these pooling layers reduces the height and width of the of the feature maps by a factor of. Thus, the feature maps after the fifth pooling layer are of size , with and denoting the height and width of the input image, respectively (cf. Fig. 4). While VGG-16 feeds the features of the last pooling layer to a stack of three fully-connected layers for classification, the feature maps in FCN-8s are processed by a
convolutional layer which produces one feature map per class, i.e., class scores. These feature maps are then upsampled with bilinear interpolation to generate pixel-wise classification results for segmentation.
Additionally to the stream described above, the network contains two more streams to incorporate lower level features into the final classification. This has proven to boost the network performance [7, 35]. Specifically, the network uses the features from the third and the fourth stack of convolutions and fuses them with the features of the final stack of convolutions. Since these feature maps are of a different size than the feature maps in the main stream, the fusion and upsampling happens in multiple steps (cf. Fig. 4). Fusion, in this case, is a simple addition of the feature maps.
Iv-B DeepLab v2
Deeplab v2 is an FCNN which is based on the architecture of ResNet-101 and was proposed by Chen et al. in 2016. Using atrous convolution, multi-scale image representations and Conditional Random Fields , the method scores mean IoU on the segmentation task of the Pascal VOC Challenge .
Atrous convolutions, also known as dilated convolutions, have a practical advantage over regular convolution and max-pooling layers, as the spatial resolution of the feature map can be retained without any upsampling involved . Although using atrous convolutions throughout the network can lead to feature maps at the original image resolution, it is computationally more expensive to keep full-sized feature maps throughout the entire network. Thus, DeepLab v2 is a hybrid network architecture that includes both atrous convolutions and standard convolutions with max-pooling and bilinear interpolation.
Chen et al. further propose multi-scale image processing with their network architecture. They let rescaled versions of the original image be processed by multiple branches of the network. All versions are then upsampled by bilinear interpolation to obtain the original image resolution and fused by selecting the maximum response across the different scales.
Lastly, a fully-connected CRF  is applied as a post-processing step to further improve the results.
Iv-C Pyramid Scene Parsing Network (PSPNet)
Like DeepLab v2, PSPNet  is an FCNN architecture that is based on ResNet  with atrous convolutions. It was proposed in different versions, building on top of ResNet-50, ResNet-101, ResNet-152 and ResNet-269. On the Pascal VOC Challenge, the method scores mean IoU. However, since the performance boost of the deeper versions of PSPNet over the ResNet-50 based version is less than (cf. ), we use the latter for our two-class segmentation task.
The main difference from DeepLab v2 is that PSPNet, as its name suggests, uses a pyramid pooling module to fuse features from different sub-regions of the image at different scales, while DeepLab v2 uses multi-scale image processing.
We train and evaluate the FCNNs described in Section IV
on semantic segmentation. Thereby we explore transfer learning, binarized images, and data augmentation techniques. All experiments are performed on an NVIDIA Titan X GPU.
We report the mean IoU score over the entire test set. For each class, the IoU score is:
where , , and denote the true positives, false positives, and false negatives, respectively.
The mean IoU is simply the average over all classes. We compute mean IoU as Long et al. :
where denotes the number of classes, denotes the number of pixels of class predicted to belong to class , and denotes the total number of pixels of class .
The evaluation is performed using the DIVA Layout Analysis Evaluator tool . This tool is used for document segmentation for multi-labeled pixel GT and has also been used for the ICDAR2017 Layout Analysis for Challenging Medieval Manuscripts . It appropriately fits our requirements as it takes care of ambiguous regions (cf. Fig. 3).
V-a Transfer Learning and Finetuning
As our dataset contains only images for training, it is natural to exploit other datasets for pretraining and then finetune the network parameters on our target dataset. For pretraining, we exploit both, the ILSVRC dataset which contains million real-world images  and the DIVA-HisDB dataset which contains images from historic documents .
When initializing the weights with ILSVRC pretraining, the convolutional layers just keep their weight matrices. However, since the original network architectures use fully-connected layers for classification and our networks are fully convolutional, we have to convert the fully connected layers to convolutional layers to benefit from the pretraining. The trick is to interpret these layers as convolutional layers with kernels that cover the entire input region (cf. ). Now, all we have to do is to reshape the corresponding weight matrices and we have a FCNN.
V-B Data Augmentation
To compensate for our limited training data, we use multiple data augmentation techniques to artificially increase our training data.
First, we use simple random cropping. Since the networks are too large to fit in the GPU memory with large images (cf. Section III), we crop patches of pixels from the images.
Second, we employ the data augmentation technique that was proposed by Szegedy et al. to train the Inception networks . This is a more sophisticated data augmentation technique which also adds some scaling and aspect ratio invariance to the images. We found these properties particularly useful for our heterogeneous dataset (cf. Section III). The method samples patches from the images uniformly at random with the following limitations. The area of the patch covers to of the whole image and its aspect ratio is between and . This patch is then scaled to pixels.
In a last set of experiments, we binarize the training images, as this has been used frequently for segmentation tasks where the background color is more or less uniform . The images are binarized using adaptive thresholding and again cropped to pixels to fit in the GPU memory. We manually verify the representativeness of the samples and the areas of interest in the images.
|Pretraining||Data Augmentation||Mean IoU|
|DeepLab v2||ILSVRC||Random Cropping|
In the following, we report the results of the experiments and attempt to give reasons for the outcomes.
For evaluation of the trained networks, we pass patches of pixels from the test images through the network in a sliding-window manner with an overlap of pixels. Predictions of pixels which are present in multiple patches are averaged. The scores achieved by the different networks are presented in Table II.
Table II reveals that neither binarization of the images nor pretraining on DIVA-HisDB yields good results. In the case of training and testing on binary images, lots of pixel information is already lost during preprocessing. While a colored pixel can have different values, it can only be black or white in binary images. Binarization may be helpful in cases where all the data is very similar, but for our diverse images, it is crucial to use colored images. The diversity of our dataset could also explain, why pretraining on DIVA-HisDB resulted in a poor mean IoU of only . When trained on DIVA-HisDB, FCN-8s reaches a mean IoU of more than on the DIVA-HisDB test set, which is comparable to the state-of-the-art. However, if these weights are used as initialization for training on our dataset, the network is unable to generalize over the new document types, because it is too inclined to the homogeneous DIVA-HisDB images.
Pretraining the networks on the ILSVRC dataset yields overall good results, as the trained convolutional filters are very robust and generalize well thanks to the amount of training data.
The results of the work signify that the current segmentation approaches and data augmentation techniques work well with the proposed complex dataset. The dataset contains images from multiple sources and a variety of different types of artifacts.
Another strong contribution of the paper is the introduction of the new dataset. While it may seem that the number of pages is small, the task is to label every pixel and hence the dataset is large enough. Also the results show that with the reasonable amount of augmentation we are able to achieve great results.
Furthermore, it is worth mentioning the noise in the GT. The GT generation is well explained in section III. Although there exists some noise in the GT-pixels due to the binarization, it does not affect the evaluation significantly. Furthermore, the main purpose of the proposed work is to signify how well we can perform on a difficult dataset of historical documents. However, the improvement of the GT can be done in the future with the provided dataset, as discussed in section VIII.
As far as the results are concerned, the FCNN based approaches are well capable of segmenting the documents. While DeepLab v2 and PSPNet perform significantly better on real-world images than FCN-8s (cf. Section IV), FCN-8s performs best on our dataset. This result is consistent with a finding by Afzal et al.  who found VGG-16 better suited for document image classification, then ResNet-50. When analyzing the errors visually, we find that the ResNet based networks have problems detecting fine pencil annotations. This relates to one of the failure modes discovered by Chen et al., where their best performing model “fails to capture the delicate boundaries of objects, such as bicycle and chair” .
With a data augmentation technique that fits well to our dataset, we are able to reduce the error by half and achieve a final mean IoU score of . At visual inspection, our best model is almost perfect in detecting the handwritten annotations. Most errors come from noisy or incorrect GT (cf. Fig. 5).
Our results on this difficult dataset is one step forward towards achieving a general purpose segmentation method that could work with most of the documents.
Viii Conclusion and Future Work
In this work, we have presented a new and challenging dataset for document layout analysis in terms of semantic segmentation. Furthermore, we have trained and evaluated multiple combinations of different network architectures, weight initializations, and data augmentation strategies. The models trained in this paper do not include any domain-specific post-processing methods. Employing domain-specific knowledge would further improve the results, as has been shown in  recently.
An important direction for future work is the analysis of the generalization and scalability of our methods. While our dataset is already of diverse nature, it would be interesting to see if it also applies to documents of other libraries and countries. Furthermore, specialized methods could be adopted from our classifier to work on specific types of books and to incrementally work better after a few pages.
As it is not feasible to get correct binarization for all the images in the dataset with automatic approaches, a possible improvement in the GT pixels could be performed by manually cleaning the binarized documents.
Finally, it would be good to integrate our approach into production systems. A possible scenario is that human experts can inspect all annotations found by the system, slightly correct them (enabling active learning and human-in-the-loop systems), and integrate the findings into higher-level research questions in the humanities, such as the change of annotation-behavior over the centuries.
We thank Insiders Technologies GmbH for providing the hardware to conduct the experiments described in this paper. We also want to thank Abhash Sinha and Shridhar Kumar for performing the experiments with PSPNet. Finally, we thank the following libraries for providing us with the digitized images for the database: Universitätsbibliothek der Humboldt-Universität zu Berlin; Universitätsbibliothek Kassel; Staatsbibliothek Berlin; and Staatsarchiv Marburg.
-  A. Asi, R. Cohen, K. Kedem, and J. El-Sana, “Simplifying the reading of historical manuscripts,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 826–830.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015.
-  M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes challenge: A retrospective,” International journal of computer vision, vol. 111, no. 1, pp. 98–136, 2015.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
A. Graves, “Supervised sequence labelling,”
Supervised Sequence Labelling with Recurrent Neural Networks, pp. 5–13, 2012.
-  M. Z. Afzal, J. Pastor-Pellicer, F. Shafait, T. M. Breuel, A. Dengel, and M. Liwicki, “Document image binarization using lstm: A sequence learning approach,” in Proceedings of the 3rd International Workshop on Historical Document Imaging and Processing, ser. HIP ’15. New York, NY, USA: ACM, 2015, pp. 79–84.
-  F. Simistira, M. Bouillon, M. Seuret, M. Würsch, M. Alberti, R. Ingold, and M. Liwicki, “Icdar2017 competition on layout analysis for challenging medieval manuscripts,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 1361–1370.
-  K. Chen, M. Seuret, J. Hennebert, and R. Ingold, “Convolutional neural networks for page segmentation of historical document images,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 965–970.
-  Y. Xu, W. He, F. Yin, and C.-L. Liu, “Page segmentation for historical handwritten documents using fully convolutional networks,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 541–546.
-  K. Zagoris, I. Pratikakis, A. Antonacopoulos, B. Gatos, and N. Papamarkos, “Distinction between handwritten and machine-printed text based on the bag of visual words model,” Pattern Recognition, vol. 47, no. 3, pp. 1051 – 1062, 2014, handwriting Recognition and other PR Applications.
-  K. Chen, H. Wei, M. Liwicki, J. Hennebert, and R. Ingold, “Robust text line segmentation for historical manuscript images using color and texture,” in 2014 22nd International Conference on Pattern Recognition, Aug 2014, pp. 2978–2983.
K. Chen, M. Seuret, M. Liwicki, J. Hennebert, and R. Ingold, “Page segmentation of historical document images with convolutional autoencoders,” inDocument Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 1011–1015.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
-  M. Liwicki, V. Frinken, and M. Z. Afzal, “Latest developments of lstm neural networks with applications of document image analysis,” in Handbook of Pattern Recognition and Computer Vision. World Scientific, 2016, pp. 293–311.
-  M. Z. Afzal, S. Capobianco, M. I. Malik, S. Marinai, T. M. Breuel, A. Dengel, and M. Liwicki, “Deepdocclassifier: Document classification with deep convolutional neural network,” in 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Aug 2015, pp. 1111–1115.
-  M. Z. Afzal, A. Kölsch, S. Ahmed, and M. Liwicki, “Cutting the error by half: Investigation of very deep cnn and advanced training strategies for document image classification,” arXiv preprint arXiv:1704.03557, 2017.
-  A. Kölsch, M. Z. Afzal, M. Ebbecke, and M. Liwicki, “Real-time document image classification using deep cnn and extreme learning machines,” arXiv preprint arXiv:1711.05862, 2017.
-  J. Pastor-Pellicer, S. España-Boquera, F. Zamora-Martínez, M. Z. Afzal, and M. J. Castro-Bleda, “Insights on the use of convolutional neural networks for document image binarization,” in International Work-Conference on Artificial Neural Networks. Springer, 2015, pp. 115–126.
-  C. Tensmeyer and T. Martinez, “Document image binarization with fully convolutional neural networks,” arXiv preprint arXiv:1708.03276, 2017.
-  J. Younas, M. Z. Afzal, M. I. Malik, F. Shafait, P. Lukowicz, and S. Ahmed, “D-star: A generic method for stamp segmentation from document images,” in Document Analysis and Recognition (ICDAR), 2017 14th IAPR International Conference on, vol. 1. IEEE, 2017, pp. 248–253.
-  F. Trier, M. Z. Afzal, M. Ebbecke, and M. Liwicki, “Deep convolutional neural networks for image resolution detection,” in Proceedings of the 4th International Workshop on Historical Document Imaging and Processing. ACM, 2017, pp. 77–82.
-  A. Ul-Hasan, M. Z. Afzal, F. Shafait, M. Liwicki, and T. M. Breuel, “A sequence learning approach for multiple script identification,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 1046–1050.
-  Y. Fujii, K. Driesen, J. Baccash, A. Hurst, and A. C. Popat, “Sequence-to-label script identification for multilingual ocr,” arXiv preprint arXiv:1708.04671, 2017.
-  R. Ahmad, M. Z. Afzal, S. F. Rashid, M. Liwicki, and T. M. Breuel, “Scale and rotation invariant OCR for pashto cursive script using MDLSTM network,” in 13th International Conference on Document Analysis and Recognition, ICDAR 2015, Nancy, France, August 23-26, 2015, 2015, pp. 1101–1105.
-  S. B. Ahmed, S. Naz, M. I. Razzak, S. F. Rashid, M. Z. Afzal, and T. M. Breuel, “Evaluation of cursive and non-cursive scripts using recurrent neural networks,” Neural Comput. Appl., vol. 27, no. 3, pp. 603–613, Apr. 2016.
-  X. Yang, M. E. Yümer, P. Asente, M. Kraley, D. Kifer, and C. L. Giles, “Learning to extract semantic structure from documents using multimodal fully convolutional neural network,” CoRR, vol. abs/1706.02337, 2017.
-  F. Simistira, M. Seuret, N. Eichenberger, A. Garz, M. Liwicki, and R. Ingold, “Diva-hisdb: A precisely annotated large dataset of challenging medieval manuscripts,” in Frontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on. IEEE, 2016, pp. 471–476.
-  C. Clausner, C. Papadopoulos, S. Pletschacher, and A. Antonacopoulos, “The enp image and ground truth dataset of historical newspapers,” in 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Aug 2015, pp. 931–935.
-  C. Papadopoulos, S. Pletschacher, C. Clausner, and A. Antonacopoulos, “The impact dataset of historical document images,” in Proceedings of the 2Nd International Workshop on Historical Document Imaging and Processing, ser. HIP ’13. New York, NY, USA: ACM, 2013, pp. 123–130.
-  S. Pletschacher and A. Antonacopoulos, “The page (page analysis and ground-truth elements) format framework,” in Pattern Recognition (ICPR), 2010 20th International Conference on. IEEE, 2010, pp. 257–260.
-  A. Kölsch, M. Z. Afzal, and M. Liwicki, “Multilevel context representation for improving object recognition,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 05, Nov 2017, pp. 10–15.
-  P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Advances in neural information processing systems, 2011, pp. 109–117.
-  F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2881–2890.
-  M. Alberti, M. Bouillon, R. Ingold, and M. Liwicki, “Open Evaluation Tool for Layout Analysis of Document Images,” ArXiv e-prints, Nov. 2017.
-  Y. Zhong, K. Karu, and A. K. Jain, “Locating text in complex color images,” in Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1, ser. ICDAR ’95. Washington, DC, USA: IEEE Computer Society, 1995, pp. 146–.