End-to-end Handwritten Paragraph Text Recognition Using a Vertical Attention Network

12/07/2020 ∙ by Denis Coquenet, et al. ∙ 0

Unconstrained handwritten text recognition remains challenging for computer vision systems. Paragraph text recognition is traditionally achieved by two models: the first one for line segmentation and the second one for text line recognition. We propose a unified end-to-end model using hybrid attention to tackle this task. We achieve state-of-the-art character error rate at line and paragraph levels on three popular datasets: 1.90 3.63 using any segmentation label contrary to the standard approach. Our code and trained model weights are available at https://github.com/FactoDeepLearning/VerticalAttentionOCR.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Offline handwritten text recognition consists in recognizing the text in an image scanned from a document. An image of a word, of a line or a paragraph of text, or even a full document is analyzed, and the sequence of characters that compose the text is expected as output. This paper focuses on paragraph text recognition based on an end-to-end segmentation-free neural network. Indeed, while line segmentation and handwriting recognition have been studied for decades now, they remain a challenging task. Moreover, they have rarely been studied and optimized together in one single trainable system.

Historically, early works have applied segmentation at character level and each character was then classified. Later on, segmentation was applied at word level first, then at line level. But whatever the segmentation level considered, the problem related to the definition of the individual entities to be segmented remains ambiguous. In this study, we go a step further and process a paragraph of text without any explicit segmentation step, neither during training nor during decoding.

To carry out to the recognition of whole page images, current approaches still rely on a two-step approach. In a first step, the document is segmented into lines of text, which are then recognized in a second step. The recognition is performed applying an optical model, thus the name Optical Character Recognition (OCR). Taken separately, each of these two steps brings rather good results but if considered together they show three major drawbacks. Firstly, they require ground truth segmentation labels as well as transcription labels at line level, which is very costly to produce by hand. Secondly, this two-step approach accumulates the errors of each individual step: segmentation errors induce OCR errors, while the OCR stage produces its own errors. Finally, a two-step strategy implies that any modification of one stage leads to retraining both stages so as to optimize the performance of the whole model.

The most critical point with using a two-step strategy is the need to provide ground truth segmentation labels, inducing a definition of a handwritten line at the pixel level. Indeed, apart from the fact that providing segmentation annotation is costly, it raises the question of the definition of a line. Baseline, X-height or bounding boxes are examples of target labels for segmentation that have been frequently used in the literature, all with their pros and cons [Renton2018]. According to the line definition chosen, one also has to define a relevant metric to evaluate the quality of the segmentation, which remains an open problem. The Intersection over Union (IoU) and the mean Average Precision (mAP) are commonly used, but it is a matter of some debate. Indeed, a high IoU or mAP might not be in accordance with the transcription results (measured for example by the character error rate), which may be low for large IoU or mAP. In the same vein, the cross-entropy loss used to train a segmentation system is not directly connected to the ultimate goal of the system: the transcription quality. Finally, let us emphasize that it is frequent to find variability in labels from one annotator to another. This paper aims at providing a model freed from all of these constraints.

To get rid of those constraints, we suggest using a segmentation-free model that processes whole handwritten paragraphs using an attention process. In this model, recognition and implicit segmentation are learned in an end-to-end fashion, so as to optimize both processes altogether. Most of the contributions of the literature have successfully used neural networks for line segmentation and text line recognition as well, reaching state-of-the-art results. In some other application domains, however, Attention Neural Networks have been successfully applied for many tasks such as translation [Bahdanau2014], speech recognition [Chorowski2015]

, image captioning

[Xu2015] or even OCR applied at line level [Chowdhurry2018]. This leads us to think that both tasks (segmentation and recognition) could be handled by a single neural network with attention as a control block. In this paper, we propose an end-to-end model that performs character recognition by implicit line segmentation using an attention mechanism. In this study, we apply such a model at paragraph level.

In brief, we make the following contributions:

  • We propose the Vertical Attention Network: a novel encoder-decoder architecture using hybrid attention for text recognition at paragraph level.

  • The approach relies on an implicit line segmentation performed in the latent space of a deep model.

  • It outperforms the state of the art on RIMES, IAM and READ 2016 datasets compared to both line-level and paragraph-level approaches.

  • We compare favorably this architecture with a standard two-step approach based on line segmentation followed by character recognition.

  • We introduce a new efficient dropout strategy decreasing the character error rate by 0.5 points compared to standard ones.

This paper is organized as follows. Related works are presented in Section 2. Section 3 is dedicated to the presentation of the proposed architecture. Section 4 is devoted to the experimental environment. It provides, inter alia, description of datasets, training and implementation details. Experiments and results are detailed in Section 5. We draw conclusions of this work in Section 6.

2 Related Works

In the literature, only very few works have been devoted to multi-line text recognition, and most studies have concentrated on isolated lines recognition. We can classify these pioneer works into two categories: those using an explicit word/line segmentation, requiring segmentation and transcription labels, and those without segmentation at all, only requiring transcription labels.

2.1 Approaches using explicit line segmentation

Explicit line segmentation approaches are two-step methods that sequentially detect the text lines of a paragraph, and then proceed to their recognition. To our knowledge, [Chung2020] is the only reference that gathers in the same study segmentation and text line recognition as two separate networks. However, a lot of works focus on one of these two tasks separately. Segmentation is generally handled by a Fully Convolutional Network (FCN) as in [Renton2018, DHSegment, ARUNet]

. Regarding handwritten lines recognition, many types of architectures have been studied: Multi-Dimensional Long-Short Term Memory (MDLSTM)

[Voigtlaender2016]

, hybrid Convolutional Neural Network (CNN) and Bidirectional LSTM (BLSTM)

[Wigington2017, Bruno2016] and more recently encoder-decoder with attention [Michael2019], Gated CCN (GCNN) [Coquenet2019] and Gated FCN (GFCN) [Yousef_line, Coquenet2020]. Except for attention-based character recognition models, which use cross-entropy loss, training an OCR model at line level was made possible thanks to the Connectionist Temporal Classification (CTC) proposed in [CTC]. Indeed, it enables to align output sequences of characters with input sequences of features (or pixels) of different and variable lengths.

In addition to these works, one can notice some recent trends around the segmentation stage with two distinct approaches: the first one comes from object detection methods and the second one is based on start-of-line prediction.

[Carbonell2019, Carbonell2020, Chung2020] focus on models that follow the object detection approach: they are based on word (or line) bounding boxes prediction. They use a Region Proposal Network (RPN) combined with a non-maximal suppression process and Region Of Interest (ROI) pooling to obtain bounding boxes for each word or line in the input image. OCR is then applied on those boxes to predict the output text. In [Carbonell2019, Carbonell2020], the authors propose such a model and introduce a multi-task end-to-end architecture working at word level. In [Chung2020], the focus is on the modular approach. The authors propose a pipeline with four modules: a passage identification finds the handwritten text areas, then an object-detection-based word-level segmentation is applied and words are merged into lines afterwards. Finally, OCR is used on those lines, combined with a language model.

Other works focus on predicting start-of-line coordinates and heights. In [Moysset2017], a CNN+MDLSTM is used to predict start-of-line references and an MDLSTM is used for text lines recognition with a dedicated end-of-line token. This is particularly useful in the context of multi-column texts. In [Wigington2018, Wigington2019], a VGG-11-based CNN is used as start-of-line predictor. Then, a recurrent process predicts the next position based on the current one until the end of the line, generating a normalized line. A CNN+BLSTM is finally used as OCR on lines. Some examples with start-of-line, line segmentation and line transcription labels are needed to pretrain the different subnetworks individually ; the network is then able to work only with transcription labels. [Wigington2019] is similar to [Wigington2018] but can handle transcriptions without line breaks.

2.2 Segmentation-free approaches

Unlike explicit line segmentation approaches, segmentation-free approaches output the transcription result of a whole paragraph, performing recognition without a prior segmentation step. Among the segmentation-free approaches, one can notice two trends: a first one based on recurrent attention mechanisms and a second one exploiting the two-dimensional nature of the task in a one-step, non-recurrent process.

To our knowledge, [Bluche2016, Bluche2017]

are the only works applying attention mechanisms to the task of multi-line text recognition, achieving a kind of implicit line/character segmentation. These works propose an encoder-decoder architecture with attention blocks at line and character levels, respectively. Both architectures use a CNN+MDLSTM encoder and an MDLSTM-based attention module. The encoder produces feature maps from the input image while the attention module recurrently generates line or character representations applying a weighted sum between the attention weights and the features. Finally, the decoder outputs character probabilities from this representation. These architectures require pretraining on line level images, but they do not need line breaks to be included in the transcription labels.

Only two other works present segmentation-free approaches, focusing on the dimensional aspect to output predictions without a recurrent process. They are [Schall2018] and [Yousef2020]. In [Schall2018], they propose a two-dimensional version of the CTC: the Multi-Dimensional Connectionist Classification (MDCC). Using a Conditional Random Field (CRF), ground truth transcription sequences are converted to a two-dimensional model (a 2D CRF) able to represent multiple lines. A line separator label is introduced in addition to the CTC blank label. The CRF graph enables to jump from one line to the following one, whatever the position in the current line. As for the CTC, repeating labels account for only one. An MDLSTM-based network is used to generate probabilities in two dimensions, preserving the spatial nature of the input image. Pretraining is also used beforehand, this time at word level, with the standard CTC.

Finally, in [Yousef2020] the authors focus on learning to unfold the input paragraph image i.e.

into a single text line. The system is trained to concatenate text lines to obtain a single large line before character recognition takes place. This is mainly carried out with bi-linear interpolation layers combined with a FCN encoder. This transformation network enables to use the standard CTC loss and to process the image in a single step. Moreover, it neither needs pretraining nor line breaks in the transcriptions and achieves state-of-the-art results.

The approaches proposed in [Schall2018] and [Bluche2017] are the first attempts reported in the literature towards end-to-end handwritten paragraph recognition, but they have remained below the state-of-the-art results on the IAM dataset [IAM]. Both [Bluche2016] and [Yousef2020] present interesting segmentation-free approaches reaching competitive results.

The recurrent process of [Bluche2016] enables to model dependencies between lines, but is computationally expensive due to the MDLSTM layers, which could lead to high training and prediction time. The fully convolutional model involved in [Yousef2020] allows high computation parallelization, but its recurrence free process does not enable to model dependencies between lines. In addition, it implies a large number of parameters, of around 16.4 million.

In this work, we suggest combining the advantages of both approaches to design a fast, efficient and lightweight end-to-end model for paragraph recognition. The proposed model is based on an encoder-decoder architecture using attention as in [Bluche2016], but we use a FCN encoder and an attention module without recurrent layers to reduce the computation time while implying few parameters at the same time.

3 Architecture

We propose an end-to-end model, called Vertical Attention Network (VAN), following the encoder-decoder principles with an attention module. The overall model is presented in Figure 1. We wanted the encoder stage to be modular enough in order to be plugged in different architectures dedicated to text lines, paragraphs or documents recognition without any adaptation. To this end, we chose a Fully Convolutional Network encoder that can deal with input images of variable height and width, possibly containing multiple lines of text. The attention module is the main control block: it recurrently produces attention weights that focus where to select the next features for the recognition of the next text line. It also detects the end of the paragraph leading to the end of the whole process. Finally, the decoder stage produces character probabilities for each frame of the line features. We now describe each module in detail.

Figure 1: Architecture overview.

3.1 Encoder

The encoder can be seen as a FCN feature extractor. It takes as input an image , H, W and C being respectively the height, the width and the number of channels (C=1 for a grayscale image, C=3 for a RGB image). Then, it outputs some feature maps with , and .

As shown on figure 1(a) the encoder is made up of a large composition of Convolution Blocks (CB) and Depthwise Separable Convolution Blocks (DSCB) in order to enlarge the context that will serve at the final decision layer. Indeed, the receptive field is 961 pixels height and 337 pixels width.

CB and DSCB include Diffused Mix Dropout (DMD) layers which correspond to a new dropout strategy we propose. This strategy is detailed in section 3.4.

A CB is a succession of two convolutional layers, followed by an Instance Normalization layer. A third convolutional layer is applied at the end of this block. Each convolution layer uses

kernels and is followed by a ReLU activation function; zero padding is introduced to remove the kernel edge effect. While the first two convolutional layers have a

stride, the third one has a stride for CB_1, stride for CB_2 to CB_4 and stride for CB_5 and CB_6. This enables to divide the height by 32 and the width by 8 so as to reduce the memory requirement. DMD is applied at three possible locations which are just after the activation layers of the three convolutional layers. A representation of such a Convolution Block is shown on Figure 1(b).

DSCB is presented on Figure 1(c), it differs from CB in two respects. On the one hand, the standard convolutional layers are superseded by Depthwise Separable Convolutions (DSC) [DSC]. The aim is to reduce the number of parameters while keeping the same level of performance. On the other hand, the third convolutional layer now has a fixed stride of . In this way, the shape is preserved until the last layer of the encoder.

Residual connections with element-wise sum operator are used between blocks when it is possible, that is to say when the shape is unchanged. This enables to strengthen the parameters update of the first layers of the network during back propagation.

(a) FCN Encoder overview.
(b) Convolution Block (CB) definition.
(c) Depthwise Separable Convolution Block (DSCB) definition.
Figure 2: FCN Encoder definition. 1(a) presents an overview while 1(b) and 1(c) respectively detail the Convolution Block and the Depthwise Separable Convolution Block. Specified dimensions are related to the output of the corresponding layer.

3.2 Attention

The purpose of the attention module is to recursively select the features that represent the current line of text in the feature space . Therefore, it implicitly learns how to segment a document into lines in the latent space. The attention module relies on a soft attention, as proposed by Bahdanau in [Bahdanau2014], and more specifically on a hybrid attention. Indeed, attention weights are computed from both the content and the location output layers. Since our goal is to iterate the line features generation, the attention only focuses on the vertical axis.

The attention module is thus organized in the following way: it takes as inputs the feature maps , the state of the decoder at the previous time step and the previous attention weights and it predicts the current line features and the probability to continue or to stop predicting another line .

3.2.1 Line features prediction

The attention module first begins with an initialization phase:

  • Given that the features width is variable due to the nature of the encoder, an AdaptiveMaxPooling layer is used to get a fixed width of 100. Then, a densely connected layer pushes the horizontal dimension to collapse. The remaining vertical representation is called .

  • Initial attention weights are only zeros.

Then, for each time step, the attention process performs the following steps (biases are omitted in the formulas for readability).

  • We define a coverage vector keeping track of all previous attention weights. This vector is clamped between 0 and 1. It is defined as:

  • Coverage vector and attention weights are concatenated into a single tensor, gathering all needed location information. We name it

    .

  • A specific Attention Convolution Block (ACB) is used. ACB consists in one-dimensional convolutional layer with filters of kernel size 15 and stride 1 with zero padding. An instance normalization is then applied.

  • The attention weights are then computed by

    where

    The multi-scale information contains both local and global information. Indeed, information from features and previous attention weights can be considered as local since they are position-dependant; and information from the decoder hidden state can be seen as global since it is related to all predictions made so far. , , and are weights of densely connected layers (, , and ).

  • Finally, line features are computed as a weighted sum between the original features and the attention weights:

    We emphasize that, although the attention module produces a vertical focus , the line recognizer has a broad view of the input signal due to the size of the receptive field. Therefore, it makes the method robust to inclined or non-straight lines.

3.2.2 End-of-paragraph detection

The nature of the task leads to an unpredictable number of text line predictions. Indeed, we do not know in advance the number of lines in the paragraph image at prediction time. The simplest way to handle this problem is to define a constant large enough to cover the maximum size of a paragraph in the dataset, as in [Bluche2016]. The model thus iterates times and stops. This will be referred to as the fixed-stop approach.

A more elegant way to solve this problem is to learn when to stop predicting a new line transcription i.e. to detect when the whole paragraph has been processed. It will be called learned-stop approach. This end-of-paragraph detection is performed at each iteration, computing the probability to stop or to continue the prediction . This probability is computed from both the decoder and the multi-scale information. The goal is to gather essential information in a single one-dimensional tensor. In a first step, the multi-scale information is summed up into a one-dimensional tensor in the following way:

  • A one-dimensional convolutional layer with kernel size 5, stride 1 and zero padding is applied on .

  • AdaptiveMaxPooling is used to reduce the height to a fixed value of 15.

  • A densely connected layer pushes the vertical dimension to collapse, leading to

Then, the produced tensor is combined with the decoder hidden state through concatenation:

Finally, is obtained through:

Training and prediction processes related to both stopping strategies are discussed below in section 5.1.2.

3.3 Decoder

The decoder is made up of a single LSTM layer with cells followed by a one-dimensional convolutional layer with kernel going from to channels. is the size of the character set. This last layer produces a posteriori probabilities of each character and the CTC blank label, for each of the frames. Hidden states are initialized with zeros.

3.4 Diffused Mix Dropout

Dropout is commonly used as regularization during training to avoid over-fitting. We introduce Diffused Mix Dropout, a new dropout strategy. Mix Dropout tries to take advantage of the two main modes proposed in the literature, namely standard dropout [Dropout] and spatial (or 2d) dropout [SpatialDropout]. A Mix Dropout (MD) layer applies one of the two possible dropout modes randomly. It enables to take advantage of the two implementations in a single layer. Diffused Mixed Dropout (DMD) consists in randomly applying MD at different locations among a set of pre-selected locations. In the model, we use DMD with dropout probability of 0.5 and 0.25 respectively for standard and 2d modes. Both modes have equivalent probabilities to be chosen at each execution. The benefit of using this dropout strategy is discussed in Section 5.4 through experiments.

4 Experimental conditions

This section is dedicated to the presentation of the experimental conditions: datasets, preprocessing, data augmentation strategy, computed metrics and the training details are described.

4.1 Datasets

In this paper, we show that the Vertical Attention Network outperforms state-of-the-art results on three popular handwriting datasets, regardless of whether it is compared to single line or whole paragraph approaches: RIMES [RIMES], IAM [IAM] and READ 2016 [READ2016].

4.1.1 Rimes

RIMES is a popular handwriting dataset composed of gray-scale images of French handwritten text produced in the context of writing mails scenarios. The images have a resolution of 300 dpi. In the official split, there are 1,500 pages for training and 100 pages for the evaluation. To be comparable with other works, we took the last 100 training images for validation, as usually done. Segmentation and transcription are provided at paragraph, line and word levels. We used the first two segmentation levels in this work.

4.1.2 Iam

We used the handwriting IAM dataset which is made of handwritten copy of text passages extracted from the LOB corpus. It corresponds to gray-scale images of English handwriting with a resolution of 300 dpi. This dataset provides segmentation at page, paragraph, line and word levels with their corresponding transcriptions. In this work, we used the line and paragraph levels with the commonly used but unofficial split as detailed in Table I.

4.1.3 Read 2016

READ 2016 has been proposed in the ICFHR 2016 competition on handwritten text recognition. This dataset is composed of a subset of the Ratsprotokolle collection used in the READ project. Images are in color and represent Early Modern German handwriting. READ 2016 provides segmentation at page, paragraph and line levels. We ignored lines with null transcription in the ground truth leading to a small difference in the split compared to the official one. We removed the character "" from the ground truth since it is not a real character.

Dataset Level Training Validation Test Charset size
RIMES Line 10,532 801 778 100
Paragraph 1,400 100 100
IAM Line 6,482 976 2,915 79
Paragraph 747 116 336
READ 2016 Line 8,349 1,040 1,138 89
Paragraph 1,584 179 197
Table I: Datasets split in training, validation and test sets and associated number of characters in their alphabet

Sample images from these datasets are shown in Figure 3. As we can see, the number of lines per paragraph can vary a lot: from 2 to 18 for RIMES, from 2 to 13 for IAM and from 1 to 26 for READ 2016. RIMES exhibits more layout variability compared to the other datasets: a single paragraph image can contain multiple indents and variable interline heights. IAM layout is more structured and regular. READ 2016 sets oneself apart since its paragraph images can contain only the page number, few words or a large number of lines. In addition, these images have more noise in particular because of the bleed-through effects of handwriting occurring on the back of the pages. In Section 5.1.1, we show that the VAN is robust enough to handle such different datasets without adapting the model for each of them.







Figure 3: Left to right: Images from the IAM, RIMES and READ 2016 datasets at paragraph level.

4.2 Preprocessing

We use the following preprocessings that are applied similarly on every datset: we downscale the input images by a factor of 2 through a bilinear interpolation. We are thus working with images with a resolution of 150 dpi for IAM and RIMES for example. For the VAN, we zero pad the input images to reach a minimum height of 480 px and a minimum width of 800 px when necessary. This assures that the minimum features width will be 100 and the minimum features height will be 15, which is required by the model as described previously.

4.3 Data Augmentation

In order to reduce over-fitting and to make the model more robust to fluctuations, we set up a data augmentation strategy, applied at training time only. We used the following augmentation techniques: resolution modification, perspective transformation, elastic distortion and random projective transformation (from [Yousef2020]), dilation and erosion, brightness, and contrast adjustment and sign flipping. Each transformation has a probability of 0.2 to be applied. They are applied in the given order, and they can be combined except for perspective transformation, elastic distortion and random projective transformation which are mutually exclusive.

4.4 Metrics

In order to evaluate the quality of the recognition, we use the Character Error Rate (CER) and the Word Error Rate (WER). Both are computed with the Levenshtein distance (denoted as ) between the ground truth and the prediction , normalized by the length of the ground truth. To avoid that errors in the shortest lines have more impact on the metric than errors in the longest ones, we normalize by the total length of the ground truth:

where is the number of images in the data set. WER formula is exactly the same but at word level instead of character level.

For the evaluation of the segmentation task we present, as a comparative approach, we used two metrics: IoU and mAP. The segmentation is applied at pixel level with two classes, namely text and background. The IoU is defined as the intersection of text-classified pixels divided by the union of text-classified pixels between the ground truth and the prediction. We compute the global IoU over a set of images by weighing the image IoU by its number of pixels. We compute the mAP for an image as the average of AP computed for IoU thresholds between 50% and 95% with a step of 5%. Image mAPs are weighted by the number of pixels of the images to give the global mAP of a set.

Other metrics such as the number of parameters, the training time or the prediction time are useful to compare models. In the following experiments, models are trained during two days. The training time is computed as the time to reach 90% of the convergence. This is a more relevant value since tiny fluctuations can occur after numerous epochs.

4.5 Training details

We used the Pytorch framework with the apex package to enable mixed precision training thus reducing the memory consumption. We used the Adam optimizer for all experiments with an initial learning rate of

. Trainings are performed on a single GPU Tesla V100 (32Gb). Models have been trained with mini-batch size of 16 for line-level model and mini-batch size of 8 for the segmentation model and for the VAN.

4.6 Additional information

  • We use best path decoding to get the final predictions from the character probabilities lattice.

  • We do not use any language model, lexicon constraint nor any other post-processing.

  • We use exactly the same hyperparameters from one dataset to another. Moreover, the models are not adapted on each one: the last layer is the only difference since the datasets do not have the same character set size.

5 Experiments

In a first part, we evaluate the Vertical Attention Network for paragraph recognition and show that it outperforms state-of-the-art results on each dataset. Then, we compare the VAN with a standard two-step approach (line segmentation followed by recognition) and demonstrate its superiority on many criteria. We also show the results of the FCN encoder applied to isolated text lines to see the positive contribution of the VAN compared to the single line recognition approach. Finally, we highlight the beneficial effect of our new dropout strategy compared to standard ones.

5.1 Paragraph-level experiments with the Vertical Attention Network

In this section, we first provide a comparison with the state of the art on RIMES, IAM and READ 2016 datasets. Then, we present experiments conducted on the IAM paragraph dataset with the VAN to study different training strategy.

5.1.1 Comparison with state-of-the-art paragraph-level approaches

The results presented in this section are given for the VAN, with pretraining and using the learned-stop strategy. The following comparisons are made with approaches under similar conditions, i.e. without the use of a language model and at paragraph level.

Comparative results with state-of-the-art approaches on the RIMES dataset are given in Table II. The VAN outperforms other approaches on the test set with a CER of 1.90% and a WER of 8.83%.

Architecture CER (%) WER (%) CER (%) WER (%) # Param.
valid valid test test
[Bluche2016] CNN+MDLSTM* 2.5 12.0 2.9 12.6
[Wigington2018] RPN+CNN+BLSTM 2.1 9.3
Ours 1.74 8.72 1.90 8.83 2.7 M
* : with line-level attention
Table II: Recognition results of the VAN and comparison with paragraph-level state-of-the-art approaches on the RIMES dataset.

Table III shows the results compared to the state of the art on the IAM dataset. As one can see, once again the VAN also achieves new state-of-the-art results with a CER of 4.32% and a WER of 16.24% on the test set. One can notice that we use more than 6 times fewer parameters than [Yousef2020] with 2.7 M compared to 16.4 M.

Architecture CER (%) WER (%) CER (%) WER (%) # Param.
valid valid test test
[Bluche2017] CNN+MDLSTM* 16.2
[Bluche2016] CNN+MDLSTM** 4.9 17.1 7.9 24.6
[Carbonell2019] RPN+CNN+BLSTM aaaResults are given for page level 13.8 15.6
[Chung2020] RPN+CNN+BLSTM 8.5
[Wigington2018] RPN+CNN+BLSTM 6.4 23.2
[Yousef2020] GFCN 4.7 16.4 M
Ours 3.04 12.69 4.32 16.24 2.7 M
* with character-level attention
** with line-level attention
Table III: Comparison of the VAN with the state-of-the-art approaches at the paragraph level on the IAM dataset.

To our knowledge, there are no results reported in the literature on the READ 2016 dataset at paragraph or page level. The recognition results are presented in Table IV. Our approach reaches a CER of 3.63% and a WER of 16.75%.

Architecture CER (%) WER (%) CER (%) WER (%) # Param
validation validation test test
Ours 3.75 18.61 3.63 16.75 2.7 M
Table IV: VAN results for the READ 2016 dataset at paragraph level.

Figure 4 shows in detail the process of an image from the RIMES validation set with a complex layout. Images from top to bottom represent the attention weights of the 5 iterations, each one predicting a line. The intensity of the weights is encoded with the transparency of the red color. Given that attention weights are only computed for the vertical axis, the intensity is the same for all pixels of the same height. Attention weights are rescaled to fit the original image height; indeed, attention weights are originally computed for the features height, which is 32 times smaller. Line text predictions are given for each iteration below the image. As one can see, the VAN has learned the reading order, from top to bottom. The attention weights clearly focus on text lines following this reading order. Attention weights focus mainly on one features line, with smaller weights for the adjacent lines. One can note that, sometimes, the focus is not perfectly centered with the text line. This may be due to rescaling, but this can also be a normal behaviour due to the large size of the receptive field, which allows to manage slightly inclined lines. The second iteration demonstrates this phenomenon very well with only one prediction error for an inclined line. Furthermore, one can notice that the attention is less sharp when the layout is more complex between two successive lines, as in the third image, but it does not disturb the prediction process.

Je me permet de vous écrire cas je vMux
augmentes mes quantités de CD vierges
j’ai commandé 50 CD et aduellement je voudrai
en commander 100.
Merci d’avance
Figure 4: Attention weights visualization on a sample of the RIMES validation set. Transcription predictions are given for each line and errors are shown in bold

As we have seen, the proposed Vertical Attention Network achieves new state-of-the-art results on three different datasets without any specific adaptation of the architecture or the training hyperparameters for each dataset. Moreover, it is robust to the variability of line location and inclination in the input image.

5.1.2 Learning when to stop

In this experiment, we compare the two stopping strategies mentioned in Section 3.2.2, namely fixed-stop and learned-stop approaches.

The fixed-stop approach consists in iteratively predicting a fixed number of lines during the prediction phase. This number is chosen arbitrarily, we have chosen to match the different datasets used. During training the loss is defined as follows:

It corresponds to the sum of the CTC loss () between the ground truth of the L lines of the image and the L text line predictions. Since paragraphs in the same mini-batch do not have the same number of lines, additional predictions are aligned with null transcripts. This enables the network to learn to only predict blanks once all the lines in the image have already been processed. The goal is to avoid predicting the same line more than once.

The learned-stop approach implies another way to stop the prediction process: the model involves a new module. Its purpose is to determine, at each iteration, whether the image has been fully processed or not, and, therefore, whether to continue to predict or not. This leads to the addition of a cross-entropy loss () to the CTC loss, applied to the decision probabilities . The corresponding ground truth is deduced from the line breaks. The final loss is then:

where is set to 1.

Fixed-stop and learned-stop training and prediction processes are respectively detailed in Algorithm 1 and 2. The text in black corresponds to the fixed-stop approach while the blue color corresponds to the additions implied by the learned-stop approach.

input : image , ground truth transcription composed of L lines
, , , ;
;
for t=1 to L do
       , ;
       if  then
            ;
      else
            ;
       ;
      
;
;
backward( ;
Algorithm 1 Training process.
input : image
;
, , , , «     » ;
;
while   do
       , ;
       ;
       ;
       ;
      
Algorithm 2 Prediction process.

Comparative results between these two methods are given in Table V. For both approaches, the encoder and the convolutional layer of the decoder are pretrained on isolated line images of the same dataset i.e. their weights are initialized with the corresponding weights, learned with a text line recognition model, as in the experiments reported in section 5.3. We define , as the average of the absolute values of the differences between the actual number of lines in the image and the number of predicted lines . This metric is used to evaluate the efficiency of the learned-stop approach. For images in the dataset:

As we can see, we reach rather equivalent CER (4.35% and 4.32%) and WER (16.48% and 16.24%) for both stopping strategies. Moreover, the prediction time is not impacted since data formatting, tensor initialization and encoder-related computations take much longer than the recurrent process, which is made up of only a few layers at each iteration.

Stop method CER (%) WER (%) Train. time Pred. time mean_diff
Fixed 4.35 16.48 0.70 d 36 ms 21.32
Learned 4.32 16.24 0.57 d 38 ms 0.05
Table V: Comparison between fixed-stop and learned-stop methods with the VAN on the test set of the IAM dataset.

In terms of training time, as noticeable in Figure 5, the convergence is a bit faster for the stop-learned method which is guided by the additional cross-entropy loss. However, in the end, they reach similar CTC loss values, reflected by the almost identical CER.

Figure 5: CTC training loss curves comparison for the VAN for both fixed-stop and learned-stop approaches on the IAM dataset.

Figure 6 compares the for both approaches on the validation set of the IAM dataset. As we can see, the model using the learned-stop method quickly learns to distinguish when it has predicted all the lines in the image. The fixed-stop approach leads to a plateau, by definition. This results in a of 0.05 for the learned-stop approach and 21.32 for the fixed-stop approach for the test set of IAM.

Figure 6: Comparison of the evolution of the mean difference between the actual number of lines in the image and the number of predicted lines on the validation set of IAM dataset for both fixed-stop and learned-stop approaches.

A simple trick that can be applied for the fixed-stop method is to stop the prediction of lines when only blank labels are predicted. This means that all lines have been predicted. It also enables to avoid the model to go over an already processed line once again after some iterations.

With the learned-stop approach, the model successfully learns both tasks: it predicts text line transcriptions with a state-of-the-art CER of 4.32 and it determines when to stop with a high precision since the on the test set is only 0.05. It means that, on average, one line is omitted or processed twice for 20 paragraph images.

5.1.3 Impact of pretraining

In this experiment, we show the impact of the pretraining on the VAN performance. For this purpose, we train the model with two configurations: in the first one, the model is trained from scratch and, in the second one, the model is initialized with pretrained weights for the encoder and for the convolutional layer of the decoder. Both trainings follow the learned-stop approach.

Figure 7 shows the evolution of the CTC loss during training, with and without pretraining, on the IAM dataset. We can clearly notice that, as expected, training the model from scratch takes much longer for it to converge. Indeed, in this case, each module has to be trained to learn both the reading order and the optical model of characters. On the other hand, the pretrained model simply has to learn the reading order. It has to be noted that training from scratch is feasible in a reasonable training time; the model converges in about 2 days.

Figure 7: CTC training loss curves comparison for the VAN, with and without pretraining, on the IAM dataset.

Results on the IAM test set for this experiment are given in Table VI. The model learned from scratch did not completely converge at the end of the 2-day training resulting in a higher CER of 6.96% compared to 4.32% for the pretrained model.

Pretraining CER (%) WER (%) Training time
6.96 25.39 1.77 d
4.32 16.24 0.57 d
Table VI: Impact of the pretraining on the VAN. Results are given on the test set of the IAM dataset.

This experiment shows us that the VAN can be trained from scratch, without using any segmentation label. However, it comes at the cost of longer training time. In addition, this leads to an increase in the CER of 2 points, which remains competitive compared to the state of the art.

So far, we have provided strong results in favor of the Vertical Attention Network by achieving state-of-the-art performance on three datasets. We have also provided arguments and experimentation results showing that Vertical Attention Network can be learned from scratch, without the need to use any segmentation label at all. In the following paragraph, we will now compare the Vertical Attention Network more thoroughly with a standard two-step approach.

5.2 Comparison with the standard two-step approach

In this section, we compare the VAN with the standard two-step approach on the IAM dataset. In this respect, we introduce two new models: a first model performs line segmentation and a second one is in charge of OCR at line level.

The line segmentation model follows a U-net shape architecture and is based on our FCN encoder. Indeed, the are successively upsampled to match the features maps shapes of CB_5, CB_4, CB_3, CB_2 and CB_1 (Figure 1(a)). This upsampling process is handled by Upsampling Blocks (UB). UB consists in DSC layer followed by DSC Transpose layer and instance normalization. This block also includes DMD layers. Each UB output is concatenated with the feature maps from its corresponding CB. A final convolutional layer output only two feature maps, to classify each pixel of the original image between text and background.

The OCR model for text line images is illustrated in Figure 8. It is made up of the FCN encoder, followed by an AdaptiveMaxPooling layer, that pushes the vertical dimension to collapse. A final convolutional layer predicts the probability of the characters and of the CTC blank label.

Figure 8: Line text recognition architecture overview.

For the segmentation part, training is carried out at pixel level with ground truth of line bounding boxes. It is trained with the cross-entropy loss. The OCR is trained with the CTC loss. We used a mini-batch size of 8 for the segmentation task and of 16 for the OCR.

We now detail the two steps of this approach. In a first step, paragraphs are segmented into lines:

  • ground truth bounding boxes are modified in order to avoid overlaps: we divide their height by 2

  • a paragraph image is given as input of the network

  • a 2-class pixel segmentation (text or background) is output from the network

  • adjacent text-labeled pixels are grouped as contours

  • bounding boxes are created as the smallest rectangles containing each contour; their height is multiplied by 2

  • the input image is cropped using those bounding boxes to generate lines

In a second step, the OCR model, trained on the IAM dataset at line level, is applied on those predicted lines. The segmented lines are ordered by their vertical position (from top to bottom) and line predictions are concatenated to compute the CER and the WER at paragraph level.

The performances of both tasks taken separately and together are shown in Table VII. As one can see, the results are good for both tasks separately: we get 81.51% for the IoU and 85.09% for the mAP concerning the segmentation task; and a CER of 4.95% and a WER of 18.73% for the OCR. However, when we take the output of the segmentation as input for the OCR, it leads to more than doubling the CER. Indeed, the line segmentation errors induce prediction errors of the OCR.

Architecture IoU (%) mAP (%) CER (%) WER (%) # Param.
Line seg. model 81.51 85.09 1.8 M
OCR on lines 4.95 18.73 1.7 M
Two-step approach 81.51 85.09 10.20 34.36 1.8+1.7 M
Table VII: Results of the two-step approach on the test set of IAM.

We can now compare the Vertical Attention Network to the two-step approach. Comparison on the IAM test set is summarized in Table VIII. First, one can notice that the VAN reaches by far a better CER of 4.32% compared to 10.2%. Prediction time is computed as the average prediction time, on the test set, to process a paragraph image. As one can see, even though the segmentation step is without recurrence, it requires much more time for prediction due to the formatting of the input required for the OCR, including the bounding boxes extraction from the original images. Moreover, it cumulates prediction times of the two models involved. Despite its recurrent process, the total prediction time for the VAN is shorter than that of the two-step approach since one iteration is very fast. Moreover, the VAN can be trained with or without pretraining, thus requiring only line transcription for the latter case. In addition, it implies fewer parameters, this is notably due to the two models required by the two-step approach. Except for the training time, which is a bit higher for the VAN, it only provides advantages compared to the two-step approach.

Architecture CER (%) WER (%) # Param. Training Prediction
time time
Two-step 10.2 34.36 1.8+1.7 M 0.03 + 0.59 d 749 + 28 ms
VAN 4.32 16.24 2.7 M 0.59 + 0.57 d 38 ms
Table VIII: Comparison of the two-step approach with the Vertical Attention Network, results are given for the test set of the IAM dataset.

5.3 Line level analysis

In this section, we compare the results obtained with the text line OCR model (Figure 8) to the state-of-the-art approaches evaluated in similar conditions i.e. at line level and without language model. The aim is to highlight the contribution of the VAN processing full paragraph images instead of isolated lines. Indeed, both models have the same encoder module.

Table IX shows state-of-the-art results on the RIMES dataset at line level. We report competitive results with a CER of 3.19% on the test set compared to [Puigcerver2017] which reached 2.3%. It has to be noted that in [Voigtlaender2016], a 4-gram word language model (LM) is used and that [Puigcerver2017] does not use exactly the same dataset split for training and validation. In conclusion we can highlight the performance of the VAN obtained at paragraph level which achieves a CER of 1.90% on the test set, which corresponds to decreasing the line level CER by 1.29 compared to our line-level model, and 0.4 compared to the best known performance of the literature.

Architecture CER (%) WER (%) CER (%) WER (%) P̈aram.
valid valid test test
[Voigtlaender2016] 2D-LSTM + LM 2.8 9.6
[Puigcerver2017] CNN+BLSTM bbbThis work uses a slightly different split (10,203 for training, 1,130 for validation and 778 for test 2.2 9.6 2.3 9.6 9.6 M
Ours (line level) 2.17 8.87 3.19 10.25 1.7 M
Ours (VAN) 1.74 8.72 1.90 8.83 2.7 M
Table IX: Comparison with the state of the art on the line-level RIMES dataset.

Comparison with state-of-the-art results on the IAM dataset is presented in Table X. [Voigtlaender2016] cannot be directly compared with other results since it uses a 3-gram word LM and a 10-gram character LM but it gives an idea of its impact on the performance. Compared to the others, we reached state-of-the-art results with a CER of 4.95 % on the test set. [Yousef_line] and [Michael2019] reach similar results but the former uses a model with a large number of parameters compared to ours and the latter uses a more complex architecture with recurrent process using attention and a hybrid loss. It should be noted however that [Puigcerver2017, Michael2019, Yousef_line] use the same dataset split, which is slightly different from ours. As a matter of fact, since previous experiments imply pretraining at line level, it was not possible to use the same split since some lines for training and validation are extracted from the same paragraph image for example. On the IAM dataset, the VAN also reaches a better CER at paragraph level than our line model with 4.32% compared to 4.95%.

Architecture CER (%) WER (%) CER (%) WER (%) # Param.
validation validation test test
[Voigtlaender2016] CNN+MDLSTM + LM 2.4 7.1 3.5 9.3 2.6 M
[Puigcerver2017] CNN+BLSTM cccThis work uses a slightly different split (6,161 for training, 966 for validation and 2915 for test 3.8 13.5 5.8 18.4 9.3 M
[Yousef_line] GFCN footnotemark: 3.3 4.9 > 10 M
[Michael2019] Seq2seq (CNN+BLSTM) footnotemark: 4.87
Ours (line level) 3.32 13.60 4.95 18.73 1.7 M
Ours (VAN) 3.04 12.69 4.32 16.24 2.7 M
Table X: Comparison with the state of the art at the line level on the IAM dataset.

The results on the READ 2016 dataset are gathered in Table XI. It has to be noticed that [READ2016] (RWTH) uses a 10-gram character LM. We reached state-of-the-art CER on the test set with 4.28% compared to 4.66% for [Michael2019] which takes second place. As we can see the FCN OCR used performs even better on a more complex dataset than RIMES or IAM. Once more, the paragraph-level VAN exceeds the results of the line-level approaches with a CER of 3.63%.

Architecture CER (%) WER (%) CER (%) WER (%) # Param.
validation validation test test
[Michael2019] Seq2seq (CNN+BLSTM) 4.66
[READ2016]* CNN+MDLSTM + LM 4.8 20.9
[READ2016]** CNN+RNN 5.1 21.1
Ours (line level) 4.45 21.13 4.28 19.71 1.7 M
Ours (VAN) 3.75 18.61 3.63 16.75 2.7 M
* results from RWTH
** results from BYU
Table XI: Comparison with the state-of-the-art line-level recognizers on READ 2016 dataset.

In conclusion we can summarize our results by highlighting the superiority of the Vertical Attention Network on the RIMES, IAM and READ 2016 datasets. The Vertical Attention Network achieves better results processing full paragraph images compared to line-level state-of-the-art approaches. It even outperforms the FCN using the same encoder. Multiple factors can explain this result: segmentation ground truth annotations of text lines are prone to variations from one annotator to another bringing complexity that is not present in the VAN. Indeed, the model implicitly learns to segment the lines so it does not have to adapt to pre-formatted lines; it uses more context (with a large receptive field) and uses it to focus on the useful information for the recognition purpose. In addition, the VAN decoder contains a LSTM layer that may have a positive impact by using a large context when producing the output character sequence, acting as a language model.

A key element to reach such results with a deep network is to use efficient regularization strategies. We discuss the new dropout strategy we propose in the following paragraph.

5.4 Dropout strategy

We use dropout to regulate the network training and thus avoid over-fitting. We defined Diffused Mix Dropout (DMD) as presented in section 3.4 to improve the results of the model. We carried out some experiments to highlight the contribution of DMD over commonly used standard and 2d dropout layers. Experiments are performed with the line text recognition model on the IAM dataset, results for the test set are shown in Table XII. The columns from left to right correspond respectively to the number of dropout layers per block (CB and DSCB), the type of dropout layer used, the associated dropout probabilities, the use of the diffuse option (using only one or all dropout layers per block), the CER and the WER.

# type p diffused CER (%) WER (%)
test test
Baseline 3 mix 0.5/0.25 4.95 18.73
(1) 3 standard 0.5 5.16 19.34
(2) 3 2d 0.25 5.33 20.00
(3) 1 mix 0.5/0.25 5.29 19.97
(4) 1 standard 0.5/0.25 5.51 20.53
(5) 1 2d 0.5/0.25 5.66 21.42
(6) 3 mix 0.5/0.25 6.69 23.99
(7) 3 mix 0.16/0.08 6.61 23.53
Table XII: Dropout strategy analysis.

In (1) and (2), Mix Dropout layers are respectively replaced by standard and 2d dropout layers, preserving their corresponding dropout probability. Using Mix Dropout leads to an improvement of 0.21 points of CER compared to standard dropout and of 0.38 compared to 2d dropout.

In (3), only one Mix Dropout is used, after the first convolution of the blocks, leading to a higher CER with a difference of 0.34 points. In (4) and (5), we are in the same configuration as (3) i.e. with only one dropout layer per block. MixDropout is superseded by standard dropout in (4) and by 2d dropout in (5) resulting in an increase of the CER of 0.22 and 0.37 points compared to (3). This shows once more the positive impact of Mix Dropout layers in another configuration.

Finally, in (6) and (7), Mix Dropout layers are placed at each of the three positions i.e. they are all used at each execution, contrary to the baseline, which uses only one dropout layer per execution. While (6) keeps the same dropout probabilities, (7) divides them by 3. In both cases, the associated CER are higher than the baseline with 6.69% for (6) and 6.61% for (7).

We can conclude that our dropout strategy leads to an improvement of 0.56 points of CER when compared to (4) and (5) that do not use Mix Dropout nor the diffuse option.

5.5 Discussion

As we have seen, the VAN outperforms the state of the art on multiple datasets at paragraph level without adjusting any hyper parameter for each one. Moreover, it takes input of variable sizes, so it could handle whole page images without any modification. CER could also be lowered by using standard n-gram language model easily as for traditional text line recognition models. Thus, this network should be considered to process single-column text document. As a matter of fact, as for

[Bluche2016] and supposedly [Yousef2020], models are designed and limited to process single-column multi-line text documents with relatively horizontal text lines. The next step would be to focus on processing images with more complex layout such as multi-column text images.

6 Conclusion

In this paper, we proposed the Vertical Attention Network: a novel end-to-end encoder-decoder segmentation-free architecture using hybrid attention. It is able to handle paragraph images of variable sizes, and we proved its efficiency on the RIMES, IAM and READ 2016 datasets. Indeed, it outperforms both paragraph-level and line-level state-of-the-art approaches on these datasets. Its implicit line segmentation process enables to recognize complex layout including slightly inclined lines. In addition, it does not need pretraining and could easily be used for whole page text recognition. Unlike standard two-step architectures, the Vertical Attention Network can be trained without line segmentation label. The resulting unified model finally reaches a better CER than two-step approaches, for a shorter prediction time and a lighter architecture in terms of parameters. The proposed new dropout strategy, based on Diffused Mix Dropout layers, leads to an improvement of 0.56 points of CER for our model at line level.

Acknowledgments

The present work was performed using computing resources of CRIANN (Regional HPC Center, Normandy, France). This work was financially supported by the French Defense Innovation Agency and by the Normandy region.

Références