Automatic document understanding and more specifically the layout analysis of historical documents is still an active area of research. This task consists in splitting a document into different regions according to their content. It can be a very challenging task due to the variety of documents. In this paper, we focus on text line segmentation in historical documents.
Some recent works using pre-trained weights 
have emerged. The use of pre-training has shown many advantages such as decreasing the training time and improving the model’s accuracy. However, these weights are often learned on natural scene images (ImageNet dataset) and then applied to document images. Since document images are really different from natural scene images, in this paper we question the interest of using such a pre-training stage.
Our main contributions are as follows. We propose Doc-UFCN, a U-shaped Fully Convolutional Network for text line detection. We show that this model outperforms a state-of-the-art method (dhSegment ) while having less parameters and a reduced prediction time. In addition, we show that pre-training on documents images instead of natural scene images can increase the results even with few training data.
Ii Related works
In the recent years, the interest given to the analysis of historical documents has been boosted by the competitions on textline detection , baseline detection  , layout analysis  or writer identification . There have been successful models and systems tackling historical document analysis problems such as text line segmentation or layout analysis.
Oliveria et al. 
recently proposed a convolutional neural network with an encoder pre-trained on ImageNet. This method has shown promising results on various tasks with few training data and the training time is significantly reduced due to the pre-trained encoder. This model differs from Doc-UFCN since its encoder follows the ResNet-50 achitecture and is pre-trained on natural scene images. Our encoder is smaller than dhSegment’s, has way less parameters and is fully trained on document images. However, both models have similar decoders with the use of encoding feature maps during the decoding step.
Barakat et al. 
proposed a fully convolutional network for text line detection. Their architecture consists in successive convolutions and pooling layers during the encoding stage and upsampling layers in the decoding stage. Unlike us, they only use low level feature maps during the decoding step, upsampling them many times before combining them. This architecture has shown good results on Arabic handwritten pages but requires binarized input images. Mechi et al. presented an adaptive U-Net architecture for text line segmentation. Their encoder also consists in convolutions and pooling layers. During the decoding step, successive standard convolutions, transposed convolutions and sigmoid layers are applied. For our model, we also chose to use standard convolutions followed by transposed convolutions for the upsampling step. This allows to have the same resolution at both sides of the network.
Grüning et al.  proposed a more complicated architecture composed of two stages to detect baselines in historical documents. First a hierarchical neural network (ARU-Net) is applied to detect the text lines. This ARU-Net is an extended version of the U-Net  architecture. A spatial attention network is incorporated to deal with various font sizes in pages. In addition, they added residual blocks to the U-Net architecture. This enables to train deeper neural networks while reaching higher results. Second, they apply successive steps to cluster superpixels to build baselines.
Yang et al.  designed a multimodal fully convolutional network for layout analysis. They take advantage of the text content as well as the visual appearance to extract the semantic structures of document images. This method has shown high Intersection-over-Union scores but requires more complex data annotations. Indeed, for each document image, a pixel-wise labeled image as well as textual contents are needed. Our model is based on the core architecture of this network. The use of dilated convolutions in the encoder allows to have a broader context information and more accurate results. Renton et al.  also demonstrated the advantages of using such convolutions instead of standard ones. Their fully convolutional network is composed of successive dilated convolutions that increase the receptive field. They are followed by one last standard convolution that outputs the labeled images.
Finally, Moysset et al. 
proposed a recurrent neural network to segment text paragraphs into text lines. This network differs from the systems presented above and from our because it has recurrent layers. It also differs since ground truth lines are not represented as bounding boxes but the paragraph itself is represented as a succession of line and interline labels.
Our goal is to analyse the impact of the pre-training step on the line segmentation task. To this aim, the proposed architectures are analyzed with and without pre-training. In this section, we detail two state-of-the-art architectures: dhSegment  and Yang’s . We then present our model Doc-UFCN that is inspired by the core architecture proposed by Yang and give the implementation details.
Iii-a Comparison of architectures
In this section, we detail the two architectures, dhSegment and Yang’s one, and explain the choices we made to design our own model.
dhSegment is the state-of-the-art method for multiple layout analysis tasks on historical documents. It has shown various advantages like working with few training data and a reduced training time. In addition, the code to train and test the model is open-source111https://github.com/dhlab-epfl/dhSegment and can be easily trained in the same conditions as our model to have a fair comparison.
dhSegment’s architecture is presented Figure 1(a). This model is deeper than Yang’s and can have up to 2048 feature maps. The encoder is composed of convolution (light blue and orange on the Figure 2a) and pooling layers. This encoder is first pre-trained on natural scene images  and both the encoder and decoder are then trained on document images. The decoder is quite similar to the one used by Yang and consists in successive blocks composed of one standard convolution and one upscaling layer.
Iii-A2 Yang et al.
Yang’s model is a multimodal fully convolutional network. It takes into account the visual and textual contents for the segmentation task. It has shown good performances on synthetic and real datasets of modern document images. The code to train the model is also open-source222http://personal.psu.edu/xuy111/projects/cvpr2017_doc.html.
Yang’s model is presented on Figure 1(b). It is made of 4 parts: an encoder (red blocks on the Figure 1(b)), a first decoder outputing a segmentation mask, a second decoder for the reconstruction task and a bridge (red arrows) used for the textual content. The Text Embedding Map and the bridge are used to encode the textual content of the images and then to add the text information to the visual one before the last convolution. To have a fair comparison with dhSegment, only the visual content is used. Therefore, the Text Embedding Map, the bridge and the second decoder for the reconstruction task are removed.
Iii-B Description of Doc-UFCN
Recent systems can show long inference times which can have great financial and ecological impacts. Indeed, dhSegment takes up to 66 days to detect the lines of the whole Balsac corpus (almost 2 million pages) on a GeForce RTX 2070. To this aim, we want to show the impact of the pre-trained parts on the segmentation results while having a small network and a reduced prediction time. To design our model, we chose to use the core of Yang’s network since it has a reduced number of parameters and contains no pre-trained parts. Therefore, our architecture is a Fully Convolutional Network (FCN) composed of an encoder (red blocks on the Figure 1(c)) followed by a decoder (blue blocks) and a final convolution layer. Dealing with a FCN without any dense layer has many advantages. First, it highly reduces the number of parameters since there is no dense connection. In addition, it allows the network to deal with variable input image size and to keep the spatial information as is.
To keep a light model, the second decoder used by Yang is not used in our architecture.
Iii-B1 Contracting path
The contracting path (encoder) consists in 4 dilated blocks. The dilated blocks are slightly different from those presented by Yang et al. since they consist in 5 consecutive dilated convolutions. Using dilated convolutions instead of standard convolutions allows the receptive field to be larger and the network to have more context information. Each block is followed by a max-pooling layer except for the last one.
Iii-B2 Expanding path
The goal of the expanding path (decoder) is to reconstruct the input image with a pixel-wise labeling at the original input image resolution. This deconvolution is usually done using transposed convolutions or upscaling. As suggested by Mechi et al. , we decided to replace the unpooling layers of Yang’s model by transposed convolutions in order to keep the same resolution on both the input and output. Therefore, the decoding path is composed of 3 convolutional blocks, each consisting of a standard convolution followed by a transposed convolution. In addition, the features computed during the encoding step are concatenated with those of the decoding stage (purple arrows on the Figure 1(c)).
Iii-B3 Last convolution
Iii-C Implementation details
We now present the implementation details of our model.
Iii-C1 Input image size
Since our model is inspired by Yang et al. , we decided to use the same input image size. We thus resized the input images and their corresponding label maps into smaller images of size 384384 px
, adding padding to keep the original image ratio. This allows to reduce the training time without losing too much information. We also tested another input size to see the impact of this choice (see SectionVI-E).
Iii-C2 Dilated block
As stated before, all the dilated blocks are composed of 5 consecutive dilated convolutions with dilation rates d = 1, 2, 4, 8 and 16. The blocks respectively have 32, 64, 128 and 256 filters. Each convolution has a 33
kernel, a stride of1
and an adapted padding to keep the same tensor shape throughout the block. All the convolutions of the blocks are followed by a Batch Normalization layer, a ReLU activation and a Dropout layer with a probabilityp_dilated.
Iii-C3 Convolutional block
The convolutional blocks are used during the decoding step. The expanding path is composed of 3 convolutional blocks and each block is composed of a standard convolution followed by a transposed convolution. The blocks respectively have 128, 64 and 32 filters. Each standard convolution has a 33 kernel, a stride and a padding of 1. Each transposed convolution has a 22 kernel and a stride of 2. As for the dilated blocks, all the standard and transposed convolutions are followed by a Batch Normalization layer, a ReLU activation and a Dropout layer with a probability p_conv.
Iii-C4 Last convolution
The last convolution layer is parametrized as follows: c (number of classes) filters, 33 kernel, stride and padding of 1. It is followed by a softmax layer that computes the pixel’s class conditional probabilities.
As a post-processing step, we apply the same operations as the one applied by dhSegment: pixels with a confidence score higher than a threshold t are kept and connected components with less than min_cc pixels are removed.
The models have been tested on 4 datasets for the line segmentation task. Table I summarizes these datasets.
|Dataset||Mean IoU (%)||Precision (%)||Recall (%)||F1-score (%)|
The Balsac dataset consists in 913 images extracted from 74 registers selected among 44742 registers in total. The images represent pages of acts written in french and are annotated at line level. Two examples images are shown on Figure 3.
This dataset consists in 557 annotated pages of books of hours. These pages have been selected among 500 manuscripts as they represent the variety of layouts and contents . The pages have been annotated at different levels and with various classes such as simple initials, decorated initials or ornamentations. Figure 1 shows two annotated pages for text line segmentation selected from two different manuscripts.
This dataset  is composed of 2036 annotated archival images of documents and has been used during the cBAD: ICDAR2017  competition. The images have been extracted from 9 archives and the dataset is split into Simple and Complex subsets. Each image has its corresponding ground truth in PAGE xml format. For the line segmentation task, we used the bounding boxes of the TextLine objects as labels.
This last dataset  contains 120 annotated pages extracted from 3 different manuscripts. Each manuscript has 30 training, 10 validation and 10 testing images.
V Comparison to state-of-the-art
We applied Doc-UFCN to the 4 datasets. In addition, we also trained dhSegment architecture  in the same conditions. In the following, we only compare our model to dhSegment since ours is too similar to Yang’s to be compared with. This section details the trainings and shows the results obtained.
Our model is implemented in PyTorch. We trained it with an initial learning rate of5e-3, Adam optimizer and the cross entropy loss. The weights are initialized using Glorot initialization. In addition, we used mini-batches of size 4 to reduce the training time. We tested different dropout probabilities and decided to keep the model with p_dilated = p_conv = 0.4
since it yielded higher performances on average on the validation set. The model is trained over a maximum of 200 epochs and early stopping is used to stop training when the model converges. In the end, we keep the model with the lowest validation loss.
We also trained dhSegment on our data with the same splits for a maximum of 60 epochs since the model is pre-trained and converges faster than our. We used mini-batches of size 4 and trained on patches of shape 400400 px. The initial learning rate is 5e-5 and we chose to use a ResNet50  as pre-trained encoder. Early stopping is also used and the best model obtained during training is selected.
Both models have the same post-processing step with the same hyper-parameters. After testing thresholds within a range from 0.5 to 0.9, we kept t = 0.7 since it shows the best results on the validation set, allowing the expected pixels to be predicted as text lines and rejecting those belonging to the background. Lastly, the small connected components with less than min_cc = 50 pixels are discarded. Several values have also been tested for this parameter, however, it didn’t really impact the results obtained.
We trained the two networks on the four datasets and now we report the scores obtained for both of them. Most of the existing methods are evaluated using the Intersection-over-Union (IoU) metric. The IoU measures the average similarity between the predicted and the ground truth pixels. Alberti et al.  designed a tool to evaluate the performance of a model by calculating the IoU, precision, recall and F-measure. It allows to have more information concerning the model’s performances at pixel level than just the IoU.
Therefore, to evaluate the models, we computed various pixel level metrics. We first report the Intersection-over-Union (IoU) as well as the Precision (P), Recall (R) and F1-score (F) in Table II. To be comparable, the images predicted by dhSegment are resized to 384384 px before computing the metrics. In addition, the values are only presented for the text line class (the background is not considered here).
The results obtained by our method are often better than those obtained by dhSegment. On the Balsac dataset, our model outperforms dhSegment by up to 6 percentage points for the F1-score metric. This is due to a better separation of close text lines that are often predicted as one single line by dhSegment. Our model helps separating these lines where dhSegment fails. It also helps to have smoother and more accurate contours.
So far, our model has shown better performances than dhSegment while having no pre-trained encoder. Another interesting point is that our model is way lighter than dhSegment. It has only 4.1M parameters to be learned whereas dhSegment has 32.8M parameters including 9.36M that have to be fully-trained. This leads to a reduced prediction time. Indeed, our model is up to 16 times faster than dhSegment model as shown on Table III.
|Dataset||Mean prediction time||Ratio|
|Predictions made on a GPU GeForce RTX 2070 8G.|
|Dataset||Mean IoU (%)||Precision (%)||Recall (%)||F1-score (%)|
We have shown that pre-training on natural scene images is not required to have good results on document images. It is sometimes even worse than having a different model without any pre-trained part. We now want to see if pre-training on document images instead of natural scene images can have a positive impact on the performances. Therefore, in addition to the previous experiments, we trained dhSegment and our model on a mixture of all the datasets presented before. This dataset is denoted in the following as the Multiple document dataset. The splitting obtained by mixing these images is shown in Table V.
These generic models have then been tested on each dataset. The results are reported in Table IV. We also fine-tuned the models on each single dataset. To do so, we continued the training of our model for 80 epochs and dhSegment for 40 epochs.
Without any fine-tuning, our architecture is almost always better than dhSegment’s. One can see that our architecture lacks in precision indicating that our model sometimes predicts text line pixels that belong to the background. However, recall is higher than dhSegment’s which indicates that more of the expected text line pixels are found. This is more interesting for us since it means that we don’t miss any characters. Figure 4 shows the results of the two models for an image from Horae dataset.
Fine-tuning on each single dataset is not required to get good results with any of the models. With our architecture, only the model trained on Balsac took advantage of this fine-tuning. For the three other datasets, fine-tuning didn’t improve the results since the best model obtained remains the one before re-training.
These results show that our model is better than dhSegment whatever the dataset, with and without fine-tuning. Adding this pre-training step to our model has improved the results, mostly on the READ datasets. This impact is less important on Balsac, mainly because this dataset represents 43 % of the Multiple document dataset. DIVA-HisDB is also less impacted by the pre-training. This is due to the small quantity of training data it has and the high complexity of the pages.
Vi Ablation study
We did additional experiments with our model in order to see the impact of some components as well as external factors such as the size of the training set or the input image size. Table VI summarizes the results obtained and the next sections describe the tested configurations.
|Dataset||Version||IoU (%)||P (%)||R (%)||F (%)|
|BN + Drop1||83.40||94.16||87.95||90.87|
|BN + Drop2||84.33||92.49||90.49||91.42|
|BN + Drop1||63.98||84.60||74.76||83.17|
|BN + Drop2||63.95||78.38||80.45||84.93|
|BN + Drop1||66.34||81.64||79.14||78.08|
|BN + Drop2||64.03||81.76||75.60||76.66|
|BN + Drop1||51.87||86.73||56.58||68.74|
|BN + Drop2||54.40||83.62||61.97||73.16|
|BN + Drop1||74.24||91.35||79.81||85.09|
|BN + Drop2||75.71||92.14||80.88||86.09|
Vi-a Batch Normalization
As stated in , Batch Normalization has a great impact on the convergence speed during training but can also impact the results. Indeed, our model converged more than twice faster with Batch Normalization. In addition, as shown in Table VI, Batch Normalization has a real impact on the F1-score in particular for Horae, READ-Complex and DIVA-HisDB. In addition to the quantitative results, we remarked that the visual results with Batch Normalization are also improved. It helps separating close regions but also helps joining regions that would be separated otherwise. In addition, the contours of the predicted regions are often more accurate and smoother.
We tested two configurations with dropout layers. The first one (Drop1) consists in applying a dropout with p_dilated = p_conv = 0.4 only after the dilated blocks. The second one (Drop2) consists in applying the same dropout after every convolution of the model and not only after the last one of the dilated blocks. The application of dropout layers has most of the time a good impact on the performances. Even if the first configuration gives better results on the Horae and READ-Simple datasets, the impact is greater when implemented using the second configuration.
For implementing the model, we chose to use a modified version of the dilated block proposed by Yang et al.  to have more context information to predict the text lines. To justify our choice of dilation rates, we tested 4 different configurations on the Balsac dataset. We tested blocks with only one convolution and a dilation rate of 1 (1) and blocks with a dilation rate of 16 (16). We also tested blocks with 5 convolutions with different rates (1, 1, 1, 1, 1 and 1, 2, 4, 8, 16). The results obtained are presented in Table VII.
|Dilation||IoU (%)||P (%)||R (%)||F (%)|
|1, 1, 1, 1, 1||79.93||92.02||85.77||88.57|
|1, 2, 4, 8, 16||83.79||94.80||87.86||91.11|
The results with the last configuration are better than any of the others since the receptive field is way larger and the model has more context to predict the text lines. Figure 5 shows the receptive field growth through the network. The receptive field with the dilation rate (16) corresponds to the one of Yang’s model since the dilated convolutions are not successive. Having dilated convolutions instead of standard ones really impacts the receptive field size (1000 pixels instead of 200) which results in using more context to predict the text lines and provides higher performances.
Vi-D Training set size
In addition to the ablation study, we tried to analyze the impact of the training set size on the performances. Therefore, we trained our model on 4 subsets of Balsac training set and report the results on Table VIII.
|Number of images||IoU (%)||P (%)||R (%)||F (%)|
The more the training data, the higher the IoU. However, this progression doesn’t have the same effect on the precision metric. The model trained with 365 images has even a higher precision value than the one trained with 731 images. Moreover, we see that training over only 90 images (12 % of the training set) gives quite good results which are even better than those obtained by dhSegment when trained on the whole dataset.
Vi-E Input image size
As we wanted to follow the model proposed in , we decided to train our models on images resized to 384384 px. However we want to see the impact of this choice on our results. Therefore, we trained a model on Balsac and one on DIVA-HisDB on images resized to 768768 px. Table IX shows that training on larger images improves a bit the results. However this impact is bigger when the training set contains a lot of images. Balsac dataset contains 731 training images and is more impacted than the DIVA-HisDB dataset that contains only 60 training images.
|Dataset||Size||IoU (%)||P (%)||R (%)||F (%)|
In this paper, we introduced a new model Doc-UFCN to detect the text lines from historical document images. This model takes advantage of a lot of context information due to the dilated convolutions whereas most of the existing methods only use standard ones. Moreover, it doesn’t use any pre-trained weights learned on natural scene images but has shown better performances than state-of-the-art model.
We showed that there is no need to use heavy pre-trained encoders like ResNet. Using a different architecture like ours can give better results while being lighter than dhSegment, working with less training images and having a reduced prediction time. We also showed that pre-training a simple architecture on few document images improves the line detection. We don’t need a huge amount of data to have a good pre-trained network.
Our future works will consist in evaluating our model on other tasks like the act segmentation of Balsac pages and the layout analysis of Horae images.
-  (2017-11) Open Evaluation Tool for Layout Analysis of Document Images. In 2017 14th IAPR ICDAR, Kyoto, Japan, pp. 43–47. External Links: Cited by: §V-B.
-  (2011-Sep.) Historical document layout analysis competition. In 2011 International Conference on Document Analysis and Recognition, pp. 1516–1520. External Links: Cited by: §II.
-  (2018) DhSegment: a generic deep-learning approach for document segmentation. In Frontiers in Handwriting Recognition (ICFHR), 2018 16th International Conference on, pp. 7–12. Cited by: §I, §I, §II, §III, §V.
-  (2018-08) Text line segmentation for challenging handwritten document images using fully convolutional network. In 2018 16th ICFHR, Vol. , pp. 374–379. External Links: Cited by: §II.
-  (2019) HORAE: an annotated dataset of books of hours. In Proceedings of the 5th International Workshop on Historical Document Imaging and Processing, HIP ’19, pp. 7–12. External Links: Cited by: §IV-2.
-  (2009-06) ImageNet: a large-scale hierarchical image database. In , Vol. , pp. 248–255. External Links: Cited by: §I, §III-A1.
-  (2017-11) CBAD: icdar2017 competition on baseline detection. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 01, pp. 1355–1360. External Links: Cited by: §II, §IV-3.
-  (2019-Sep.) CBAD: icdar2019 competition on baseline detection. In 2019 International Conference on Document Analysis and Recognition (ICDAR), Vol. , pp. 1494–1498. External Links: Cited by: §II.
-  (2017) READ-BAD: A new dataset and evaluation scheme for baseline detection in archival documents. CoRR abs/1705.03311. External Links: Cited by: §IV-3.
-  (2018) A two-stage method for text line detection in historical documents. CoRR abs/1802.03345. External Links: Cited by: §II.
-  (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Cited by: §II, 1(a), §V-A.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167. External Links: Cited by: §VI-A.
-  (2013-08) ICDAR 2013 competition on writer identification. In 2013 12th International Conference on Document Analysis and Recognition, Vol. , pp. 1397–1401. External Links: Cited by: §II.
-  (2019-Sep.) Text line segmentation in historical document images using an adaptive u-net architecture. In 2019 ICDAR, Vol. , pp. 369–374. External Links: Cited by: §II, §III-B2.
-  (2015-08) Paragraph text segmentation into lines with recurrent neural networks. In 2015 13th ICDAR, Vol. , pp. 456–460. External Links: Cited by: §II.
-  (2015-08) ICDAR 2015 competition on text line detection in historical documents. In 2015 13th ICDAR, Vol. , pp. 1171–1175. External Links: Cited by: §II.
-  (2018-05) Fully convolutional network with dilated convolutions for handwritten text line segmentation. International Journal on Document Analysis and Recognition (IJDAR), pp. . External Links: Cited by: §II.
-  (2015) U-net: convolutional networks for biomedical image segmentation. CoRR abs/1505.04597. External Links: Cited by: §II.
-  (2016-10) DIVA-hisdb: a precisely annotated large dataset of challenging medieval manuscripts. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 471–476. External Links: Cited by: §IV-4.
-  (2017) Learning to extract semantic structure from documents using multimodal fully convolutional neural network. CoRR abs/1706.02337. External Links: Cited by: §II, §III-C1, §III, §VI-C, §VI-E.