Automatic layout analysis of historical documents is very challenging due to the tremendous variety of documents produced over the last centuries. If we restrict the scope to illuminated medieval manuscripts, the number of concerned volumes is still very important but one can expect more similarities between them, at least regarding the types of elements found on the pages and their layout. Recognizing the text in medieval manuscripts has been a challenge for the last decades, but most of the systems focus on text line recognition and often in a single manuscript(Fischer et al., 2011, 2012). System for automatic layout analysis are usually tested on a very limited number of manuscripts, for example, a single one for (Grana et al., 2009) and six for (Yang et al., 2017).
Recently, international evaluations have been organized for medieval manuscripts layout analysis(Mehri et al., 2017), providing a large number of annotated pages but from a restricted number of manuscripts (4,436 images from 11 books for (Mehri et al., 2017)). As part of the HORAE research project (Stutzmann et al., 2019), aiming at studying the text of medieval devotional manuscripts, a large-scale annotated dataset was needed. In this paper, we describe the HORAE corpus, a corpus of annotated pages from books of hours collected for the study of their structure both from a layout and textual point of view. We first describe how the book of hours was selected in libraries from different countries and the composition of the corpus. We then describe how a sample of pages to be annotated has been selected and how it was annotated. Finally, we evaluate a state-of-the-art system for automatic layout analysis on the annotated pages. The dataset is freely accessible at https://github.com/oriflamms/HORAE/.
2. Description of the corpus
2.1. Books of hours
In the late Middle Ages, books of hours were owned and used by the laity as personal prayerbooks. They enjoyed a wide diffusion and more than ten thousand of them survive today in libraries, museums, and in private hands. They have been described as the medieval best-seller, but they show a great diversity both in the text and in the illustrations. Books of hours are a major historical source for the historical study of arts and painting since most of them are decorated with pictures (miniatures), decorated borders and initials. They are also crucial to understand the devotion and religious mindsets in medieval Europe because their numerous and very diverse texts are witnesses to customization practices and personal involvement in reading. The textual content, barely studied by scholars until now, is the object of the interdisciplinary research project HORAE (Hours: Recognition, Analysis, Edition)(6)
, encompassing HTR, Natural Language Processing and Humanities. To process automatically at a large scale such a heterogeneous collection, we recognized the need to create an annotated dataset of pages from multiple books of hours.
The corpus encompasses 500 fully digitized manuscripts111List as of 12 Oct. 2018: file 500_MSS.csv available at https://github.com/oriflamms/HORAE/.. They have been manually selected as part of the HORAE project. The present list is based on a larger census of books of hours(Stutzmann, 2019) and restricted to digitized manuscripts fully available through the IIIF International Image Interoperability Framework222https://iiif.io/. The number of manuscripts and images for each provider are given in Table 1. The main providers are the digital libraries BVMM (Bibliothèque virtuelle des manuscrits médiévaux) and Gallica, maintained respectively by Institut de Recherche et d’Histoire des Textes (IRHT-CNRS) and the French National Library.
Our goal is to develop a system based on machine learning to automatically analyze a large corpus of books of hours. To train the models, a set of annotated pages is needed. Usually, a random sample of pages is selected to be annotated. However, the images of pages in the full collection show a great variety in their appearance, due to the difference in manuscripts’ layout and to different conservation and digitization conditions (color scale, double-page, etc.). Therefore, in order to train a model with a high generalization capacity, we wanted to select a subset which would adequately represent this diversity.
Within a given manuscript, different types of pages share a similar layout, e.g. calendar pages, full-text pages, and illustrations. Within the corpus, some manuscripts have a similar layout and some are very different from all the others. These rare layouts are important in order to train a system with good generalization capacity on all image types, even the rarest ones. For this reason, it was not appropriate to select a random sample among the whole corpus. We defined a strategy to select a sample reflecting the diversity of the images in the corpus. This strategy consists of three steps: automatic page classification, clustering of the pages, and selection.
2.3.1. Automatic page classification
This step aims at identifying pages for which no text recognition will be performed and therefore no layout analysis is needed. All pages are first automatically classified into seven classes (binding, white page, calendar, miniature, miniature-and-text, text-with-miniature and full-page text) with a classifier based on deep neural networks(Boros et al., 2019). We remove pages classified as binding and white page. After this filtering step, 92,512 images are kept. In order to reduce the redundancy, we keep only two images of each class for each manuscript. At this stage, 5,738 different pages are selected from the full corpus of 107,227 pages.
2.3.2. Clustering of the pages
The goal of this clustering step is twofold: first to group similar pages so that only one image is selected among them and second to detect pages that have a rare layout ("outliers").
In order to obtain a clustering of the 5,738 pages, we applied HDBSCAN(McInnes et al., 2017)
, a density-based hierarchical clustering algorithm, to all the images represented as a vector composed of the raw pixels values from a 64 × 64 sub-resolution image with the 3 color channels concatenated (12,288 values in total per image). The clustering was performed over the pages with amin_cluster_size of 3 and a Euclidean metric. We tested different values for the min_cluster_size. With a value equal to 4, every single image was defined as "outlier". On the contrary, with a value of 2, almost all images were clustered 2 by 2. Therefore we chose 3 is a good compromise between having too many clusters and no cluster.
The clustering algorithm HDBSCAN does at once the clustering and detection of the outliers (images of a rare type). It defined 141 different clusters and it labeled ca. 2,200 images as "outliers" (they have not been grouped with one of the 141 clusters). Both groups are used for the selection.
2.3.3. Selection step
We decided to select 600 images to be annotated. This number was estimated in relation to the amount of annotation effort we wanted to spent on this task (around 60 hours). We first kept the centroids of each cluster. These 141 images correspond to the most frequent layouts in the dataset. To deal with the rare layout types, we selected the 459 images with the highest "outlier score", a value computed by the clustering algorithm.
|Processing step||Sample size (pages)|
|Full HORAE corpus (500 manuscripts)||107,227|
|Filtering page classes||92,512|
|Sample based on page classes||5,738|
2.4. Annotation process
The 600 selected images were annotated by three annotators using Transkribus333https://transkribus.eu/Transkribus/. They annotated the main structural elements: pages, decorated borders, text regions, and text lines, miniatures, initials, and inline decorative elements. The manually annotated structures are rectangular shapes, except for text lines, which were automatically detected by Transkribus’ CITlab layout analysis tool within text regions, then corrected if necessary. The output of this annotation process are PAGE XML files in which we have the coordinates and tags for all shapes; there is one PAGE XML file per annotated image.
Annotators worked according to the following order and definitions:
Page: one for each actual page of the manuscript present in the image (one or two);
Miniature: all illustrations that are not part of the borders. We defined the borders as decorations that frame the text region, whether they enclose it completely or are only on some sides.
Border elements are divided in: illustrated_border for miniatures and meaningful representations; decorated_border for ornamental border decorations; border_text for text regions that are located in the borders. The distinction between illustrated and decorated borders is not always an easy one; our annotation guidelines distinguished between ornamental elements which depict a scene or identifiable character and those that do not. For example, a border element depicting a saint would be annotated as an illustrated_border, while one depicting flowers or random animals would not.
Initials are divided along the same principles as the border elements: simple_initial for those differing from the body of the text by their ink color and/or size (some images are in black and white); decorated_initial for initials that are decorated only with purely ornamental, not iconographic elements; and historiated_initial for the usually bigger initials whose decoration contains an iconographic element and depicts a scene or a character, carrying a meaning beyond the alphabetical letter. The distinction is similar to the one between decorated and illustrated borders.
Other decorations inside the body of the text are annotated with: the self-explaining line_filler tag; the music_notation tag for musical notation; ornamentation tag for decorations within the text that do not fit any of the other tags.
Because in annotating the books of hours pages there is no ground-truth we can rely on to measure the correctness of the annotation, we chose to evaluate instead the consistency of the annotations made by our three annotators. In order to do so, we added to the 200 images they each had to annotate 10 images that all three of them annotated, to evaluate the inter-annotator agreement.
The following table presents the results of this evaluation using the Intersection-over-Union (IoU) metric. In the absence of ground truth, we compared each annotator’s boxes against the two others’.
The elements on which inter-annotator agreement is the lowest are the pages and the decorated borders. A visual examination of the relevant images shows that for these pages, the annotators had to choose between including some background the page zone, or excluding some of the actual pages from that zone in order to draw a rectangular shape, and the choices they made in that regard explain the IoU score. For the decorated borders, the differences mainly come from whether or not some decorations stemming from initials were identified as border decorations.
The IoU scores are overall very satisfactory, and we can, therefore, assume that our annotated dataset makes for good training data.
2.5. Corpus statistics
Among the 600 selected pages, some had been incorrectly classified, such as bookbinding, or images containing technical or bibliographical information, or signaling issues during the digitization process.
Since the selection process aimed at including infrequent images, it picked up more abnormal images that were not relevant for this structural annotation than a random selection would have. Moreover, we also excluded from this annotation process the calendar pages, opting to deal with them separately due to their tabular structure. The final corpus encompasses 557 annotated images444https://github.com/oriflamms/HORAE/.
The annotations are distributed as follows: 557 images, 797 pages, 843 text regions (including 51 border_text), 12,512 text lines, 284 miniatures, 892 decorated_borders, 118 illustrated_borders, 2776 decorated_initials, 551 simple_initials, 22 historiated_initials, 1112 line_fillers, 5 ornamentation, 4 music_notations.
The number of ornamental border zones is greater than the number of pages because the annotators drew one box for each decorated margin (left, right, upper, lower), instead of drawing one rectangular shape covering all decorated margins, but containing also other elements, and from which we would have had to reconstruct the actual border by subtracting overlapping miniature or text zones.
3. Automatic analysis
3.1. System description
The variety and diversity of historical documents and the small amount of annotated data available prevent us to use any off-the-shelf network designed for analyzing more classical document images. When working with such documents, one would use a document analysis solution that is flexible and able of generalization. In this regard, we opted for the network dhSegment (Ares Oliveira et al., 2018) suggested by Oliveira et al.
to run an automatic analysis over the data presented in the previous section. The network has shown competitive results over different tasks of historical document processing. It presents several advantages like working with a little amount of training data, a reduced training time, and it addresses various tasks such as baseline extraction and layout analysis. In addition, the implementation of the approach and different post-processing blocks are open-source.
The network is a Fully Convolutional Neural network and consists of a contracting path based on a ResNet-50 architecture and an expanding path that outputs full input resolution feature maps.
In order to analyze our data, two experiments are performed. Each experiment consists of line detection and a layout analysis using dhSegment. All the experiments are tested over the same dataset of 30 pages selected from the annotated dataset.
The two modules of the first experiment are trained over 220 images with 7 validation images. Both modules are trained for 30 epochs. The modules of the second experiment are trained for 60 epochs over 510 images with 17 validation images taken from the annotated data. For both experiments, the input images are not resized but patches of size 400 × 400 are used in order to run batch training.
After both experiments, we apply a post-processing step consisting in creating rectangles around the predictions. First, a threshold binarization is applied to the prediction for each class then, for each binary image, the contours of the connected component are detected and all the pixels of the bounding rectangle are assigned to the class.
The performances for these semantic segmentation tasks are measured using a pixel-wise Intersection-over-Union (IoU).
3.3.1. Line detection
For this task, the pixels of the documents can be assigned either to the text class or to the background. To obtain the binary masks, a step of thresholding is applied for each class to the feature maps. The threshold value has been optimized on the validation set of the first experiment (7 images) as presented in Table 4 and the threshold value was chosen.
As for post-processing, the connected components smaller than a given number of a pixel are removed. After several experiments, we chose the value of pixels and we observed that the impact of changing the value on the results is low. Finally, the bounding rectangles for each connected component are detected.
The results obtained by the two models are presented in Table 5. Training with more data did not show a notable gain on the results. This results indicated that the neural network can still be made more complex and optimized to benefit from more data and improve the extraction results.
3.3.2. Layout analysis
For this second task, the pixels of the documents can be assigned either to one of the following classes: decorated border, illustrated border, miniature, text region, ornamentation, historiated initial, decorated initial, simple initial, line-filler, and background. In this experiment, the threshold chosen is which is the best value obtained when optimizing this parameter on the validation set as presented in Table 4. As for the line detection task, the connected components smaller than the default value of pixels are removed and the bounding rectangles are detected.
The results for this layout analysis task are presented in Table 5. The obtained IoU is lower than for the line segmentation task due to the greater number of classes involved in this task. Training with more data shows more gain than for the first task, however, this gain is still low. We can expect that more complex neural networks would improve with more data.
Figures 5 and 6 show images that have been analyzed. On the first figure, the text lines are well detected and the boxes are similar to the ones wanted. When considering the textual content of a page, it is reasonable to include the initials in the text lines. Therefore, during the annotation process, some initials have been annotated as part of text lines as one can see on the top example. However, depending on the future application of the extracted text lines, it can be more appropriate to exclude them. This ambiguity has been partly corrected by the model since the majority of the initials are not included in the detected text lines.
The second figure shows the result of the layout analysis applied to two images. The results are also satisfying however some problems appear when creating the bounding boxes. One can see the problem on the second example where a big green rectangle has been drawn, including some text, instead of having different small green rectangles around the decorated borders.
In this paper, we introduce HORAE, a new dataset of annotated pages selected from a large number of book of hours. The corpus has been selected to include a large variety of type of pages using clustering and outliers detection. The layout of the pages has been fully manually annotated both for the text and for the elements of decoration. A state-of-the-art automatic segmentation system based on the deep neural network has been trained and first reference results are reported. The segmentation results are already satisfactory but the size of the dataset allows for improvement, for example with more complex neural networks.
This work benefited from the support of the project HORAE ANR-17-CE38-0008 of the French National Research Agency (ANR) and from the project "Horae Pictavenses: origines et provenances des manuscrits poitevins étudiés dans le texte et l’image" funded by Equipex Biblissima.
DhSegment: a generic deep-learning approach for document segmentation. In International Conference on Frontiers in Handwriting Recognition, Cited by: §3.1.
- Automatic page classification in a large collection of manuscripts based on the international image interoperability framework. In International Conference on Document Analysis and Recognition, Cited by: §2.3.1.
Transcription alignment of Latin manuscripts using hidden Markov models. In Workshop on Historical Document Imaging and Processing, Cited by: §1.
- Lexicon-free handwritten word spotting using character HMMs. Pattern Recognition Letters 33 (7), pp. 934–942. Cited by: §1.
- Picture extraction from digitized historical manuscripts. In International Conference on Image and Video Retrieval, Cited by: §1.
-  Heures : reconnaissance de l’écriture manuscrite, catégorisation automatique, éditions – horae. Note: https://anr.fr/Projet-ANR-17-CE38-0008, last accessed August, 2019 Cited by: §2.1.
- HDBSCAN: hierarchical density based clustering. The Journal of Open Source Software 2 (11). External Links: Cited by: §2.3.2.
- HBA 1.0: a pixel-based annotated dataset for historical book analysis. In Workshop on Historical Document Imaging and Processing, Cited by: §1.
- Embedding Projector: Interactive Visualization and Interpretation of Embeddings. In NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems, Cited by: §2.3.3.
- Integrated DH. rationale of the HORAE research project. In Digital Humanities, Cited by: §1.
- Résistance au changement ? les écritures des livres d’heures dans l’espace français (1200-1600). In ’Change’ in Medieval and Renaissance Scripts and Manuscripts. Proceedings of the 19th Colloquium of the Comité international de paléographie latine (Berlin, 16-18 September, 2015), Turnhout, pp. 101–120. Cited by: §2.2.
- Automatic Single Page-Based Algorithms for Medieval Manuscript Analysis. Journal on Computing and Cultural Heritage 10 (2), pp. 1–22. Cited by: §1.