TreyNet: A Neural Model for Text Localization, Transcription and Named Entity Recognition in Full Pages

12/20/2019 ∙ by Manuel Carbonell, et al. ∙ Universitat Autònoma de Barcelona 27

In the last years, the consolidation of deep neural network architectures for information extraction in document images has brought big improvements in the performance of each of the tasks involved in this process, consisting of text localization, transcription, and named entity recognition. However, this process is traditionally performed with separate methods for each task. In this work we propose an end-to-end model that jointly performs handwritten text detection, transcription, and named entity recognition at page level, capable of benefiting from shared features for these tasks. We exhaustively evaluate our approach on different datasets, discussing its advantages and limitations compared to sequential approaches.



There are no comments yet.


page 3

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The process of information extraction from document images consists in transcribing textual contents and classifying them into semantic categories (i.e. named entities). It is a necessary process in digital mailroom applications in business documents, or record linkage in historical manuscripts. Information extraction involves localizing, transcribing and annotating text, and varies from one domain to another. Despite the recent improvements in neural network architectures, efficient information extraction from unstructured or semi-structured document images remains a challenge and human intervention is still required


In the particular case of handwritten text interpretation, the performance of handwritten text recognition (HTR) is strongly conditioned by the accuracy of a previous segmentation step. But in the other way around, a good segmentation performance can be boosted if the words are recognized. This chicken-and-egg problem (namely Sayre’s paradox) can also be stated for other steps in the information extraction pipeline: transcription vs named entity recognition, or localization vs named entity categorization when there is an inherent positional structure in the document (e.g. census records, invoices or registration forms).

This fact motivates us to hypothesize that joint models may be beneficial for information extraction. In carbonell_jointner we studied the combination of handwritten text recognition with named entity recognition in text lines. Later on, we explored in carbonell_wml the interaction of text localization and recognition in full pages. In both works, we observed a benefit when leveraging the dependency of these pairs of tasks with a single joint model.

In this work we go a step further and explore the combination of the three tasks for information extraction in full pages by unifying the whole process in a single end-to-end architecture. We test our method on different scenarios, including data sets in which there is bi-dimensional contextual relevant information for the named entity tag, or there is an inherent syntactic structure in the document. Thus, we explore the benefits and limitations of an end-to-end model in comparison with architectures that integrate the different tasks of the pipeline as stand alone components. We experimentally validate the different alternatives considering different kind of documents, in particular how relevant is the geometric context, how regular is the layout, how is the strength of the named entities in the document, etc.

As far as we know, this is the first method that performs end-to-end information extraction in full handwritten pages. Our joint model is able to exploit benefits from task interaction in cases where these are strongly interdependent. Another strength of the method is its versatility, as it can be used in a wide variety of information extraction scenarios. Finally, we also contribute with a baseline for full-page information extraction in semi-structured heterogeneous documents of the IEHHR competition dataset competition.

The paper is organized as follows. In section 2 we overview the related work. In section 3 we describe our joint model. In sections 4 and 5, we present the datasets, the experimental results and discuss the advantages and limitations of our joint model. Finally, in section 6 we draw the conclusions.

Figure 1:

Overview of the proposed method. Convolutional features are extracted using FPN. The classification and regression branches calculate the positive boxes and the recognition branch predicts the transcription of the content of each box. Binary cross entropy, squared-sum and CTC losses are backpropagated through the whole model.

2 Related Work

Since information extraction implies localizing, transcribing and recognizing named entities in text, we review works that deal with each of these parts separately and with pairwise task unified models.

Text localization, which can be faced as an object detection problem, has been divided into two main type of paradigms, one-stage and two-stage. In Uijlings13 a two stage method is proposed by first generating a sparse set of candidate proposals followed by a second stage that classifies the proposals into different classes and background. Regions with CNN features (R-CNN) DBLP:journals/corr/Girshick15 replaced the second stage with a CNN, improving the previous methods. The next big improvement in terms of performance and speed came with Faster-RCNN DBLP:journals/corr/Girshick15, where the concept of anchors was introduced. When prioritizing speed in front of accuracy, we find one stage detection as the best option. Concretely, SSD DBLP:journals/corr/LiuAESR15 and YOLO DBLP:journals/corr/RedmonDGF15 have put one-stage methods close to two-stage in precision but having much greater speed performance. The decrease in precision of one-stage against two-stage methods is due to the class imbalance in focal_loss, so focal loss is introduced to cope with this problem and achieve state of the art performance both in accuracy and speed.

Regarding the transcription part, many HTR methods already perform a joint segmentation and recognition at line level to cope with the Sayre’s paradox. In this way, they can avoid the segmentation at character or word level. However, this is only partially solving the segmentation problem, because lines that are not properly segmented obviously affect the recognition. For this reason, some recent approaches propose to recognize text at paragraph level Bluche2017ScanAA, puigcerver2017multidimensional. But still, an inaccurate segmentation into paragraphs will affect the HTR performance.

Taking into account those considerations, a joint method that can perform both tasks allows the noise in the predicted segmentation and obtains better transcriptions. In our previous work carbonell_wml we proposed a model that predicts text boxes together with their transcription directly from the full page, by applying RoI pooling to shared convolutional features. In this way we avoid the need of having a perfect segmentation to get a good transcription at word level. It must be mentioned that in Wigington_2018_ECCV the benefits of end-to-end learning for full page text recognition are put in doubt, since the best transcription performance is achieved by detecting the start of the text line, segmenting it with a line follower and then transcribing it with three separately trained networks. Nevertheless no results are shown regarding the multi-task end-to-end trained model to draw a definitive conclusion.

There are several approaches for named entity recognition. In the scenario where we have error-free raw digitized text and the goal is to sequentially label entities, most approaches Lample2016NeuralAF, ma-hovy-2016-end, akbik-etal-2018-contextual

use stacked long short-term memory network layers (LSTM) to recognize sequential word patterns and a conditional random field (CRF) to predict tags for each time step hidden state. Also character level word representations capture morphological and orthographic information that combined with the sequential word information achieve good results.

In the previously mentioned cases, error-free raw text is assumed for named entity recognition. In case text is extracted from scanned documents, the situation changes. In spp_toledo

a single convolutional neural network (CNN) is used to directly classify word images into different categories skipping the recognition step. This approach, however, does not make use of the context surrounding the word to be classified, which might be critical to correctly predict named entity tags. In

Rowtula18 and Toledo2019InformationEF a CNN is combined with a Long short-term memory (LSTM) network to classify entities sequentially thereby making use of the context, achieving good results. This is improved in carbonell_jointner and wigington2019 by joining the tasks of text recognition and named entity recognition by minimizing the Connectionist Temporal Classification loss (CTC) Graves:2006:CTC:1143844.1143891 for both. Still, in these works there is no bi-dimensional context pattern analysis. Very recently an attention-based method guo2019eaten performs entity extraction in a very controlled scenario as ID cards, where a static layout implies that is not necessary to detect complex text bounding boxes. In summary, all these works suggest that it is promising to explore methodologies that integrate the three tasks in a unified model.

3 Methodology

As introduced before, our model extracts information in a unified way. First, convolutional features are extracted from the page image, and then, different branches analyze these features for the tasks of classification, localization, and named entity recognition, respectively. An overview of the architecture is shown in Figure 1.

3.1 Shared features

Since the extracted features must be used for very different tasks, i.e. localization, transcription and named entity recognition, we need a deeper architecture than the one used for each isolated task. According to recent work on object detection and text semantic segmentation DBLP:journals/corr/LinDGHHB16, a proper architecture to extract convolutional features from the image is Residual Network 18 (ResNet18) DBLP:journals/corr/HeZRS15

. We have considered exploring deeper architectures or variations of ResNet18 to improve the final performance, yet the scope of this work is not to find the best feature extraction but to unify the whole information extraction process. ResNet18 consists of 5 residual convolutional blocks. Each of those encloses 2 convolutional layers, followed by a rectifier linear unit activation and a residual connection. Table

1 shows the detailed list of blocks and configuration of the shared feature extractor.

Layer output shape kernel size kers
res-conv-block 1 H/2W/2 3 x 3 64
res-conv-block 2 H/4W/4 3 x 3 64
res-conv-block 3 H/8W/8 3 x 3 128
res-conv-block 4 H/16W/16 3 x 3 256
res-conv-block 5 H/32W/32 3 x 3 512
Table 1: ResNet18 architecture used for feature extraction.

We have also tried to reduce the amount of layers of the ResNet18 but then the training converged slower and the final validation error was higher than when using the full ResNet18 architecture. We attribute this effect to the increase of noisy detections and false positives, in which the model was confusing relevant with non-relevant text, (see Figure 5

). Consequently, we chose an intermediate depth model which allows to tackle such complex tasks at once. The output of the Feature Pyramid Network is a set of 5 down sampled feature maps with scales 8,16,32,64,128. Each of these are forwarded to the upcoming branches and their output is stacked in a single tensor, from which we later select the most confident predictions.

3.2 Classification branch

Concerning the detection and classification of objects in images, there are different approaches regarding the prediction of the probability of an object being present in a given location of the image. The two most used options are, either to regress the intersection over union (IoU) of the predicted box with the ground truth box as done in


, or to predict the probability of each object of a given class for each location with a separate branch, encoded as a one-hot vector as in

DBLP:journals/corr/LinDGHHB16. We have chosen this second option due to its performance for a wide variety of data sets. The architecture of this branch is shown in table 2.

Layer output shape kernel size kers
conv-block 1 3 x 3 256
conv-block 2 3 x 3 256
conv-block 3 3 x 3 256
conv-block 4 3 x 3 256
conv-block 5 3 x 3 1
Table 2: Classification and regression branch architectures, where downsampling levels are .

We also explored to use this branch as a named entity classifier. The motivation behind is to take context into account through the prediction of the presence of certain features in a neighbourhood of a point of the convolutional grid. The difficult part comes when attempting to capture dependencies between distant parts of the image, as it happens when a sequential approach is used. More specifically, the classification branch, or objectness loss in case of a pure text localizer classifier, is trained with the following cross-entropy loss function:


3.3 Regression branch

To predict the coordinates of the box positives, the regression branch receives the shared features and, after 4 convolutions with rectifier linear unit (ReLU), it predicts the offset values of the predefined anchors. Formally:


where are the predicted box coordinates, are the predicted offsets and are the predefined anchor box values. The offset of the predefined anchors is regressed by minimizing the mean square error:


being the vector of target offsets of the anchors for each ground truth box. The anchors are generated as the combination of the ratios ,1,2 and the scales 1, , with a base size of 32 (9 anchors) as shown in Figure 2.

Figure 2: Anchors combining ratios and scales , , , with a base size of (9 anchors).

3.4 Feature pooling

Once we have predicted the class probabilities and the coordinate offsets for each anchor in each point of the convolutional grid, we select the boxes whose confidence score surpasses a given threshold, and remove the overlapping ones applying a non-maximal suppression algorithm. With the given box coordinates, we apply RoI pooling DBLP:journals/corr/Girshick15 to the convolutional features of the full page, but saving the input to allow error backpropagation to further branches. We use the 5 levels of the feature pyramid to calculate the box anchor offsets and the objectness values. Conversely, for computational reasons, we only keep the least downsampled features for the text recognition and named entity recognition branches, as we need the highest possible resolution for those tasks.

3.5 Text recognition branch

As in carbonell_wml we build a recognition branch that will predict the text contained in each box. The architecture of this branch, shown in Table 3, consists of two convolutional blocks followed by a fully connected layer. The output of this layer is the probability of a character for each column of each one of the pooled features. From these predictions, we calculate the Connectionist Temporal Classification (CTC) loss Graves:2006:CTC:1143844.1143891. This loss, or Maximum Likelihood error, is calculated as the negative logarithm of the probability of a ground truth text sequence given the network outputs


being the set of training input-target pairs and the set of network sequence outputs.

To calculate this probability, we add the probabilities of all possible paths in the [time steps alphabet length] matrix with the forward-backward algorithm. Repeated output characters not separated by blank (non-character) are joined using the collapse function , for example, (hheeel-llo)=hello. This loss is added to the classification and regression losses to backpropagate them together for each gradient update.

Layer Output shape ker size kers
conv-block 1 pool Hpool W 3 x 3 256
conv-block 2 pool Hpool W 3 x 3 256
Fully connected pool W - -
Table 3: Recognition branch architecture.

3.6 Semantic annotation branch

As we mentioned before, one possibility to assign a semantic tag to each word is to predict its class from the classification branch for each anchor. However, this would not capture the context as the activations only rely on the convolutional feature maps of a neighborhood of each point. For this reason, we add this network branch to predict the semantic tags as a sequence from the ordered pooled features of each box. For simple layouts, such as single paragraph pages, the pooled features, which correspond to text boxes in the page, are sorted in reading order (i.e. left to right and top to bottom) by projecting a continuation of the right side of the text box. Once we have the ordered pooled features, we pad them and apply two convolutions followed by a fully connected network as a standard named entity recognition architecture. Then, we minimize the cross entropy loss shown in equation

1 for each of the sequence values.

3.7 Receptive field calculation

Our approach assumes that each activation of a neuron in the deepest layers of a CNN depends on the values of a wide region of the input image, i.e. its receptive field. Also it is important to notice that the closer a pixel is to the center of the field, the more it contributes to the calculation of the output activation. This can be a useful property for documents where the neighboring words determine the tag of a given word, but it can also be a limitation when distant entities are related in a document. To calculate how much context is taken in account for each unit of the features that are fed to the RoI pooling layer, we must look at the convolutional kernel sizes

and strides

of each layer. In this way, as in Dumoulin2016AGT, we can calculate the relation between the receptive field size of a feature map depending on the previous layer’s feature map:


where is the jump in the output feature map, which increases in every layer by a factor of the stride


By using these expressions with our architecture (ResNet 18 + FPN), we obtain a receptive field size of 1559 in the shared convolutional feature map. That means that, since the input images are 12501760, the values predicted for each unit mostly depend on the content of the whole page, giving more importance to the corresponding location of the receptive field center.

4 Datasets

One of the limitations when exploring learning approaches for information extraction is the few publicly available annotated datasets, probably due to the confidential nature of this kind of data. Nonetheless, we test our approach on three data sets. The details of amount of pages, words, out of vocabulary (OOV) words and partitions can be found in Table 4.

4.1 Iehhr

The IEHHR dataset is a subset of the Esposalles dataset competition that has been labeled for information extraction, and contains 125 handwritten pages with 1221 marriage records (paragraphs). Each record is composed of several text lines giving information of the husband, wife and their parents’ names, occupations, locations and civil states. On the sides of each paragraph we find the husband’s family name and the fees paid for the marriage. An example page is shown in Figure 5.

4.2 War Refugees

The War Refugees (WR) archives contain registration forms from refugee camps, concentration camps, hospitals and other institutions, from the first half of 20th century. We have manually annotated the bounding boxes, transcriptions and entity tags of names, locations and dates. Due to data privacy we cannot share the images, but instead we show in Figure 3 a plot of all annotated text normalized bounding boxes, where the colors correspond to different tags. As we can observe, there is a strong pattern relating the text location and its tag, although it is not fixed enough for applying a template alignment method. The main difficulty of this dataset is to distinguish relevant from non-relevant text, which in most cases only differs by its location or by a nearby printed text key description. Another challenge is the high amount (82) of out of vocabulary words, together with the high variability of the writing style and the mixture of printed and handwritten text.

Figure 3: Normalized bounding boxes of the tagged text of all training images in the WR dataset.

4.3 Synthetic GMB

We have generated a synthetic dataset (sGMB) to explore the limitations of our model, concretely in a standard named entity recognition task, in which text is unstructured and the amount of named entities within the text is low. For this purpose, we have generated synthetic handwritten pages with the text of the GMB dataset Bos2017GMB by using synthetic handwritten fonts, applying random distortions and noise to emulate realistic scanned documents. Although we are aware that it is easier to recognize synthetic documents than real ones, the difficulty here remains on the sequential named entity recognition task, especially because, contrary to the previous datasets, here the text does not follow any structure. A part of a generated page example can be seen in Figure 4. The code to generate the synthetic pages can be found in

Figure 4: A generated page from the SynthGMB dataset. A major difficulty is the sparsity of named entities with respect the other words.
Pages train 79 994 490
valid 21 231 53
test 25 323 50
Words train 2100 2837 7010
valid 878 731 1740
test 1020 1033 4085
OOV # 387 853 1372
% 37 82 34
Table 4: Characteristics of the datasets used in our experiments.

5 Experiments

In this section we describe the experiments for each dataset.

5.1 Setup

In this work we propose a method for unifying the whole process of extracting information from full pages in a single model. Nevertheless, our approach has been evaluated using the different configurations:

  • A: Triple task model. The first method variation consists in using our proposed model to perform all tasks in a unified way. Thus, we use the classification branch to label words as explained in section 2, with no sequential layers but only convolutional ones.

  • B: Triple task sequential model. The second variation also performs the three tasks in a unified way, but by concatenating the pooled features in reading order and predicting the labels sequentially, as explained in section 3.6.

  • C: Detection + named entity recognition. In this case we consider to face the extraction of the relevant named entities as a detection and classification problem. Here, we ignore the recognition part and only backpropagate the classification and regression losses from their respective branch outputs. We also consider the sequential version of this approach explained in section 3.6.

  • D: Detection + transcription. Here we combine in a unified model the tasks of localization and transcription, as seen in carbonell_wml, in contrast to an approach in which the two tasks are faced separately, where the recognition model would cope with inacurate text segmentations. Here we aim to observe how precisely we can obtain text boxes and transcriptions, so that named entity recognition can be applied afterwards.

  • CNN classifier. Finally, we evaluate the variability of the cropped words among the different categories and the difficulty of annotating words separately. For this purpose, we train a CNN network, similar to the classification branch from our proposed method, that classifies words without using any shared features for recognition or localization. This means that this network does not benefit from context information.

Full source code for all experiments is publicly available at

5.2 Metrics

Different metrics have been used to evaluate the proposed methodology. One to evaluate the performance of the text detection and named entity recognition, and another for transcription. To evaluate the performance in text localization, we used the mean Average Precision (mAP), the standard metric in object detection. Let be the precision metric, i.e. the number of true positives out of the total positive detections; be the recall metric, i.e. the number of true positives out of the total ground truth positives, i.e. the true positives plus false negatives. We consider the recall-precision map, which maps the recall value to the precision that we obtain if we had the detection threshold to get such a recall. Then, the Average Precision is the value , i.e. the area under the precision-recall graph. We also use this metric for evaluating the named entity recognition. To compare the classification performance between our full page model and the CNN for segmented words, we also calculate the F1-score:


For the transcription score we use the Character Error Rate (CER), i.e. the number of insertions, deletions and substitutions to convert the output string into the ground-truthed one, divided by the length of the string. Formally:

5.3 Results

From the results shown in Table 5, we do not observe significant differences among the approaches in the IEHHR dataset. Indeed, recognizing named entities sequentially does not give better performance than directly using a classification branch from the shared features. This suggests that the local neighborhood information seems enough to give correct predictions. Regarding the localization performance, task D (detection and transcription) shows a slightly better performance than the triple task approaches, although the improvement is not significant. This suggests that the eavesdropping effect observed in DBLP:journals/corr/Ruder17a

does not necessarily occur in every task combination, or that the architecture may be insufficient for this set of tasks with respect to the amount of data. However, the high named entity recognition performance using the triple task approach (case A) suggests that it is beneficial to combine the tasks of named entity recognition and localization. To analyze whether our method makes use of context or the sole content of the word is sufficient, we compare its performance to the CNN classifier for segmented word thanks to the F1 score. In the case of the classifier we will consider the total precision and recall as the same value, since we can not miss any word without a prediction and in such case classification errors can be considered equally false negative or false positive. By doing so, we observe a much greater performance for the triple task model 93.5 compared to the CNN 81.0. We attribute this difference to the beneficial use of the surrounding context of the words.

Regarding the WR dataset, our model (case A) successfully learned to distinguish different layouts thereby predicting the correct location of the relevant entities and their bounding boxes. Also, the combination of the localization and named entity recognition gave a eavesdropping effect, since the isolated word classification (CNN classifier) accuracy is lower than the methods that use contextual information. Also taking in account the characteristics of this dataset and the high amount of out of vocabulary words 82, we assume that predictions were mostly based on the layout or location of the words.

The Synthetic GMB dataset is used to evaluate the limitations of our joint model in front of unstructured documents. Even though there is a high performance in text localization and recognition, the performance of named entity recognition is low. The main reason is the lack of structure and the high sparsity in relevant entities. If we look at the example in Figure 4, we observe that the vast majority of the words are not named entities (their category is ’other’). Contrary, as shown in Figure 5, the real datasets are densely distributed with different types of entities. Therefore, and as expected, the CNN model, which does not take context into account, obtains better results.

Text localization (mAP)
A: triple task 0.97 0.976 0.994
B: triple task seq 0.972 0.973 0.994
C: Det+ner seq 0.969 0.975 0.997
D: det+htr 0.974 0.981 0.996
Text recognition CER (%)
A: triple task 6.1 23 2.3
B: triple task seq 6.3 28 2.6
D: det+htr 6.5 27 2.5
Named entity recognition (mAP)
A: triple task 0.92 0.972 0.357
B: triple task seq 0.91 0.956 0.594
C: det+ner seq 0.91 0.972 0.560
Isolated Named Entity Recognition (F1)
CNN classifier 0.81 0.83 0.84
Table 5: Performance of the different method variations on each dataset.

6 Discussion and Conclusion

In this paper we have presented a unified neural model to extract information from semi-structured documents. Our method shows the strengths of the pairwise interaction of some of the tasks, such as localization and transcription and also for localization and named entity recognition when the spatial information or the neighbourhood (geometric context) of a text entity influences the value to predict. Nevertheless observing the performance of triple task neural model variations compared to the separate approaches, we observe that a unified model can be limited in performance in cases where one specific task is much harder and unrelated to the others. In such a case, a separate approach would allow us to use specific techniques for this difficult unrelated task. For example, named entity recognition performance is limited by the fact that it is very difficult to generate semantically meaningful word embedding vectors (e.g. word2vec, glove) when the model input is a page image.

In summary, we conclude that a joint model is suitable for cases in which there is a strong task interdependence, but not for unstructured documents where the main difficulty is on one independent single task.


This work has been partially supported by the Spanish project RTI2018-095645-B-C21, the European Commission H2020 SME Instrument program, under Grant Agreement no. 849628, project OMNIUS, the grant 2016-DI-095 from the Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya, the Ramon y Cajal Fellowship RYC-2014-16831 and the CERCA Programme / Generalitat de Catalunya.

Figure 5: Word localizations, transcriptions and semantic annotations on an unseen page of the IEHHR dataset. The model learns to detect and classify words based not only on its appearance but also on its context. The colors illustrate the different type of named entities.