ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records

03/08/2019 ∙ by Foteini Simistira Liwicki, et al. ∙ 0

We propose a Historical Document Reading Challenge on Large Chinese Structured Family Records, in short ICDAR2019 HDRC CHINESE. The objective of the proposed competition is to recognize and analyze the layout, and finally detect and recognize the textlines and characters of the large historical document collection containing more than 20 000 pages kindly provided by FamilySearch.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Competition protocol and data

We invite all researchers and developers in the field of document layout analysis to register and participate in the new Historical Document Reading Challenge on Large Chinese Structured Family Records.

We propose 3 different tasks for this competition:

  • Task 1 Handwritten Character Recognition on extracted textlines

  • Task 2 Layout Analysis on structured historical document images

  • Task 3 Complete, integrated textline detection and recognition on a large dataset

I-a Dataset

The dataset is provided by FamilySearch 111https://www.familysearch.org/ and consists of the following collections:

  • The test set consists of in total 1.757 images selected from 12 separate books.

  • The training set consists of in total 19.360 images selected from another set of 37 separate books.

FamilySearch-DB is a collection of Chinese manuscripts that have been chosen regarding the complexity of their layout in semantic structure and font. All manuscripts are annotated using Aletheia[1], an advanced system for accurate and yet cost-effective ground truthing of large amounts of documents. The annotation of the manuscripts are available in PAGE XML format, a sophisticated XML schema which is component of the PAGE (Page Analysis and Ground truth Elements) Format Framework [2].

I-B Task Description

In this competition, we propose 3 different tasks:

  • Task 1 Handwritten Character Recognition on Extracted Textlines

  • Task 2 Layout Analysis on structured historical document images

  • Task 3 Complete, integrated textline detection and recog-nition on a large dataset

I-B1 Handwritten Character Recognition on Extracted Textlines

The scope of this competition is to recognize (OCR) given extracted textlines and, if possible, to find the segmentation points of the characters. The advantage of the character competition is that we would be able to generate synthetic historical images, once we have the characters segmented and recognized. Training data will be also available in PAGE-XML format.

We will have at least 100 different characters to be recognized, having at least 20 samples, each. The distribution of characters will be according to a typical distribution, so there are actually some characters having more than one thousand instances and thousands of characters having only a few instances. We plan to map less frequent characters to the class label unknown.

I-B2 Layout Analysis

The scope of this competition is to segment the page in different classes by assigning a different pixel value for each class: There are 2 different annotated classes:
RGB=0b00…1000=0x000008: text (foreground)
RGB=0b00…0001=0x000001: non-text (background)
The training data will be available as pixel labeled images. To avoid unfair penalties for the boundary regions, we add a value for boundary pixels:
RGB=0b10…0000=0x800000: boundary pixel (to be combined with one of the classes, expect background)
For example, a boundary text is represented as:
boundary+text=0x800008
Mislabeling between the foreground and background in the boundary region will not be penalized in the final evaluation (see Section II).

I-B3 Textline recongition

The scope of this competition is to detect and recognize (OCR) a given texline image. The training data will be available also in PAGE-XML format. The PAGE-XML file will contain the information of the textlines’ location and their corresponding text.

I-C Submission Types

We allow for three different submission formats, either an executable file (or a bash script), a virtual machine, or even a docker image:

  • Executable:

    • All dependencies should be in the same (sub)directory;

    • Provide a Link for downloading the specific zip file.

  • VirtualBox Image:

    • Provide a VirtualBox-Image as download link;

    • Provide instructions how the method can be executed inside the VirtualBox.

  • Docker Image:

    • Provide the reference image name as hosted on docker hub (see https://hub.docker.com);

    • Provide instructions how the method inside the docker image can be executed.

Ii Evaluation Tools and Metrics

Ii-a Task 1: Handwritten Character Recognition on extracted textlines

The evaluation of Task 1 will be based on the edit distance between two text strings as the minimum number of operations (insertion, deletion, and substitution) needed to transform one into the other. More details could be found at: https://web.stanford.edu/class/cs124/lec/med.pdf .

The evaluation tool for this task is written in Python and takes two input arguments:

  • GT-Folder the folder where the ground truth text files are stored.

  • Predicted-Folder the folder where the predicted text files are stored.

Usage: python evalTask1.py GT-Folder Predicted-Folder

Ii-B Task 2: Layout Analysis on structured historical document images

The evaluation of Task 2 will be similar as in our previous competition and it is freely available as open source on GitHub222https://github.com/DIVA-DIA/DIVA_Layout_Analysis_Evaluator. More information about this evaluation tool can be found in [3].

The evaluation of the layout analysis at pixel level is based on the Intersection over Union (IU) as proposed in [4]

as ranking metric. The IU, also known as the Jaccard Index, is defined as:

(1)

where TP denotes the True Positives, FP the False Positives and FN the False Negatives.

For each page, the IU is computed class-wise (background, text, don’t care regions) and then averaged. The final evaluation of a system is then obtained by averaging the IU of all pages of the dataset.

In order to provide the user a more exhaustive evaluation of prediction quality, the tool outputs several other standard metrics, including F1-score, precision, and recall — for each class and averaged over the classes. Additionally, a human-friendly visualization of the results is provided in form of a output image obtained by overlapping the evaluated prediction with the original image. This is useful to get a quick estimation of the results and to detect the area of improvement for the evaluated method.

Ii-C Task 3: Complete, integrated textline detection and recognition on a large dataset

The evaluation of Task 3 will be based on the following metrics:

  • insertedNodes total nodes inserted to transform the one aligned representation into the other.

  • deletedNodes total nodes deleted.

  • substitutedNodes total nodes substituted.

  • deletedNoinsertedEdgesdes total edges inserted.

  • deletedEdges total edges deleted.

  • totalNodes total number of nodes.

  • totalElements total number of elements in aligned GT representation without counting an ending graph edge.

  • totalErrors total errors counted.

  • errorRatio error ratio.

The evaluation tool for this task is written in Python and takes two input arguments:

  • GT-Folder the folder where the ground truth PAGE XML files are stored.

  • Predicted-Folder the folder where the predicted PAGE XML files are stored.

Usage: python evalTask3.py GT-Folder Predicted-Folder

Important note. The predicted XML files must have exact the same schema/structure as the provided ground truth XML files. Otherwise, if the predicted XML does not match the schema/structure as the provided ground truth XML file, the results of the corresponding XML file shall not be considered and will be instead counted as error. If any ground truth XML file found is invalid, it will not be evaluated (please report if such ground truth files are found). Note that the order of the lines are important. It should be in the same reading order as the given ground truth XML file.

The winner of this task will get an award price of USD provided by FamilySearch.

Iii Acknowledgements

We would like to thank DIVA Group333https://diuf.unifr.ch/main/diva/ of University of Fribourg, Switzerland, and especially Michele Alberti, for providing us the open source evaluation tool for Task 2 (Layout Analysis on structured historical document images).

References

  • [1] C. Clausner, S. Pletschacher, and A. Antonacopoulos, “Aletheia-an advanced document layout and text ground-truthing system for production environments,” in Document Analysis and Recognition (ICDAR), 2011 International Conference on.   IEEE, 2011, pp. 48–52.
  • [2] S. Pletschacher and A. Antonacopoulos, “The page (page analysis and ground-truth elements) format framework,” in Pattern Recognition (ICPR), 2010 20th International Conference on.   IEEE, 2010, pp. 257–260.
  • [3] M. Alberti, M. Bouillon, M. Liwicki, and R. Ingold, “Open Evaluation Tool for Layout Analysis of Document Images,” International Workshop on Open Services and Tools for Document Analysis, 2017.
  • [4] U.-V. Marti and H. Bunke, “Using a statistical language model to improve the performance of an hmm-based cursive handwriting recognition system,”

    International journal of Pattern Recognition and Artificial intelligence

    , vol. 15, no. 01, pp. 65–90, 2001.