Tables are widely present in a great variety of documents such as administrative documents, invoices, scientific papers, reports, or archival documents among others; and, therefore, techniques for table analysis are instrumental to automatically extract relevant information stored in a tabular form from several sources [Couasnon14]. The first step in table analysis is table detection — that is, determining the position of the tables in a document — and such a step is the basis to later determine the internal table structure and, eventually, extract semantics from the table contents [Couasnon14].
Table detection methods in digital born documents, such as readable PDFs or HTML documents, employ the available meta-data included in those documents to guide the analysis by means of heuristics[Oro09]. However, table detection in image-based documents, like scanned PDFs or document images, is a more challenging task due to high intra-class variability — that is, there are several table layouts that, in addition, are highly dependent on the context of the documents — low inter-class variability — that is, other objects that commonly appear in documents (such as figures, graphics or code listing among others) are similar to tables — and the heterogeneity of document images [Embley06]
. These three issues make difficult to design rules that are generalisable to a variety of documents; and, this has led to the adoption of machine learning techniques, and, more recently, deep learning methods.
Currently, deep learning techniques are the state of the art approach to deal with computer vision tasks[Pyimagesearch]; and, this is also the case for table detection in document images [tablebank, Schreiber17, Kerwat18]. The most accurate models for table detection have been constructed using fine-tuning [Razavian14], a transfer learning technique that consists in re-using a model trained in a source task, where a lot of data is available, in a new target task, with usually scarce data. In the context of table detection, the fine-tuning approach has been applied due to the small size of table detection datasets that do not contain the necessary amount of images required to train deep learning models from scratch. In spite of its success, this approach has the limitation of applying transfer learning from natural images, a distant domain from document images. This makes necessary the application of techniques, like image transformations [Gilani17], to make document images look like natural images.
In this work, we present the benefits of applying transfer learning for table detection from a close domain thanks to the LaTeX part of the TableBank dataset [tablebank], a dataset that consists of approximately 200K labelled images of academic documents containing tables — a number big enough to train deep learning models from scratch. Namely, the contributions of this work are the following:
We analyse the accuracy of four of the most successful deep learning algorithms for object detection (namely, Mask-RCNN, RetinaNet, SSD and YOLO) in the context of table detection in academic documents using the TableBank dataset.
Moreover, we present a comprehensive study where we compare the effects of fine-tuning table detection models from a distant domain (natural images) and a closer domain (images of academic documents from the TableBank dataset), and demonstrate the advantages of the latter approach. To this aim, we employ the 4 aforementioned object detection architectures and 7 heterogeneous table detection datasets containing a wide variety of document images.
Finally, we show the benefits of using models trained for table detection on document images to detect other objects that commonly appear in document images such as figures and formulas.
As a by-product of this work, we have produced a suite of models that can be employed via a set of Jupyter notebooks (documents for publishing code, results and explanations in a form that is both readable and executable) [jupyter] that can be run online using Google Colaboratory [colab] — a free Jupyter notebook environment that requires no setup and runs entirely in the cloud avoiding the installation of libraries in the local computer. In addition, the code for fine-tuning the models is also freely available. This allows the interested readers to adapt the models generated in this work to detect tables in their own datasets. All the code and models are available at the project webpage https://github.com/holms-ur/fine-tuning.
The rest of this paper is organised as follows. In the next section, we provide a brief overview of the methods employed in the literature to tackle the table detection task. Subsequently, in Section 3, we introduce our approach to train models for table detection using fine-tuning, as well as the setting that we employ to evaluate such an approach. Afterwards, we present the obtained results along with a thorough analysis in Section 4, and the tools that we have developed in Section 5. Finally, the paper ends with some conclusions and further work.
2 Related work
Since the early 1990s, several researchers have tackled the task of table detection in document images using mainly two approaches: rule-based techniques and data-driven methods. The former are focused on defining rules to determine the position of lines and text blocks to later detect tabular structures [Hirayama95, Jianying99, Zanibbi04]
; whereas, the latter employ statistical machine learning techniques, like Hidden Markov models[Costa09], a hierarchical representation based on the MXY tree [Cesari02] or feature engineering together with SVMs [Kasar13]. However, both approaches have drawbacks: rule-based methods require the design of handcrafted rules, that do not usually generalise to several kinds of documents; and, machine learning methods require manual feature engineering to decide the features of the documents that are feed to machine learning algorithms. These problems have been recently alleviated by using deep learning methods.
Nowadays, deep learning techniques are the state of the art approach to deal with table detection. The reason is twofold: deep learning techniques are robust for different document types; and, they do not need handcrafted features since they automatically learn a hierarchy of relevant features using convolutional neural networks (CNNs)[Goodfellow16]. Initially, hybrid methods combining rules and deep-learning models were suggested; for instance, in [Hao16] and [Borges17], CNNs were employed to decide whether regions of an image suggested by a set of rules contained a table. On the contrary, the main approach followed currently consists in adapting general deep learning algorithms for object detection to the problem of table detection. Namely, the main algorithm applied in this context is Faster R-CNN [fasterrcnn], that has been directly employed using different backbone architectures [tablebank, Schreiber17, Kerwat18], and combined with deformable CNNs [Siddiqui18] or with image transformations [Gilani17]. Other detection algorithms such as YOLO [yolov3] or SSD [ssd] have also been employed for table detection [Kerwat18, Huang19], but achieving worse results than the methods based on the Faster R-CNN algorithm. Nevertheless, training deep learning models for table detection is challenging due to the considerable amount of images that are necessary for this task — up to recently, the biggest dataset of document images containing tables was the Marmot dataset with 2,000 labelled images [marmot], far from the datasets employed by deep learning methods that consists of several thousands, or even millions, of images [ILSVRC15, pascalvoc].
In order to deal with the problem of limited amount of data, one of the most successful methods applied in the literature is transfer learning [Razavian14], a technique that re-uses a model trained in a source task in a new target task. This is the approach followed in [Schreiber17, Siddiqui18, Gilani17], where they use models trained on natural images to fine-tune their models for table detection. However, transfer learning methods are more effective when there is a close relation between the source and target domains, and, unfortunately, there is only a vague relation between natural images and document images. This issue has been faced, for instance, by applying image transformations to make document images as close as possible to natural images [Gilani17].
Another option to tackle the problem of limited data consists in acquiring and labelling more images, a task that has been undertaken for table detection in the TableBank project [tablebank] — a dataset that consists of 417K labelled images of documents containing tables. The TableBank dataset opens the door to apply transfer learning to not only construct models for table detection in different kinds of documents, but also to detect other objects, such as figures or formulas, that commonly appear in document images. This is the goal of the present work.
3 Materials and methods
In this section, we explain the fine-tuning method, as well as the object detection algorithms, datasets and evaluation metrics used in this work.
Transfer learning allows us to train models using the knowledge learned by other models instead of starting from scratch. The idea on which transfer learning techniques are based is that CNNs are designed to learn a hierarchy of features. Specifically, the lower layers of CNNs focus on generic features, while the final ones focus on specific features for the task they are working with. As explained in [Razavian14], transfer learning can be employed in different ways, and the one employed in this work is known as fine-tuning. In this technique, the weights of a network learned in a source task are employed as a basis to train a model in the destination task. In this way, the information learned in the source task is used in the destination task. This approach is especially beneficial when the source and target tasks are close to each other.
In our work, we study the effects of fine-tuning table detection algorithms from a distant domain (natural images from the Pascal VOC dataset [pascalvoc]) and a close domain (images of academic documents from the TableBank dataset). To this aim, we consider the following object detection algorithms.
3.2 Object detection algorithms
Object detection algorithms based on deep learning can be divided into two categories [Girs, yolov3]: the two-phase algorithms
, whose first step is the generation of proposals of “interesting” regions that are classified using CNNs in a second step. And theone-phase algorithms that perform detection without explicitly generating region proposals. For this work, we have employed algorithms of both types. In particular, we have used the two-phase algorithm Mask R-CNN, and the one-phase algorithms RetinaNet, SSD and YOLO.
Mask R-CNN [fasterrcnn]
is currently one of the most accurate algorithm based on a two-phase approach, and the latest version of the R-CNN family. We have used a library implemented in Keras[matterport] for training models with this algorithm.
is a one-phase algorithm characterised by using the focal loss for training on a scarce set of difficult examples, and that prevents the large number of easy negatives from overwhelming the detector during training.We have used another library implemented in Keras [kerasRetina] for traning models with this algorithm.
is a simple detection algorithm that completely eliminates proposal generation and encapsulates all computation in a single network. In this case, we have used the MXNET library [MxNET] for training the models.
frames object detection as a regression problem where a single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Although there are several versions of YOLO, the main ideas are the same for all of them. We have used the Darknet library[yolodarknet] for training models with this algorithm.
The aforementioned algorithms have been trained for detecting tables in a wide variety of document images by using the datasets presented in the following section.
3.3 Benchmarking datasets
For this project, we have used several datasets, see Table 1. Namely, we have employed three kinds of datasets: the base datasets (which are used to train the base models), the fine-tune datasets for table detection, and the fine-tune dataset for detecting other objects in document images. The reason to consider several table detection datasets is that there are several table layouts that are highly dependent on the document type, and we want to prove that our approach can be generalised to heterogeneous document images.
|Datasets||#Train Images||#Test Images||Type of images|
|Pascal VOC||16,551||4,952||Natural images|
|ICDAR13||178||60||Documents obtained from Google search|
|UNLV||302||101||Technical reports, business letters, newspapers and magazines|
3.3.1 Base Datasets
In this work, we have employed two datasets of considerable size for creating the base models that are later employed for fine-tuning.
The Pascal VOC dataset [pascalvoc]
is a popular project designed to create and evaluate algorithms for image classification, object detection and segmentation. This dataset consists of natural images which have been used for training different models in the literature. Thanks to the trend of releasing models to the public, we have employed models already trained with this dataset to apply fine-tuning from natural images to the context of table detection.
is a table detection dataset built with Word and LaTeX documents that contains 417K labeled images. For this project, we only employ the LaTeX images (199,183 images) since the Word images contain some errors in the annotations. On the contrary to the Pascal VOC dataset, where there were available models trained for such a dataset, we have trained models for the TableBank dataset from scratch.
3.3.2 Fine-tuning Datasets
We have used several open datasets for fine-tuning; however, most table detection datasets only release the training set. Hence, in this project, we have divided the training sets into two sets (75% for training and 25% for testing) for evaluating our approach. The dataset split are available in the project webpage, and the employed datasets are listed as follows.
is one of the most famous datasets for table detection and structure recognition. This dataset is formed by documents extracted from Web pages and email messages. This dataset was prepared for a competition focused on the task of detecting tables, figures and mathematical equations from images. The dataset is comprised of PDF files which we converted to images to be used within our framework. The dataset contains 238 images in total, 178 were used for training and 60 for testing.
is a data set prepared for a competition as ICDAR13. The dataset consists of 1.600 images in total, where we can find tables, formulas and figures. The training set consists of 1,200 images, while the rest of the 400 images are used for testing. This dataset has been employed three times in our work: for the detection of tables (from now on, we will call this dataset ICDAR17), for the detection of figures (from now on, we will call this dataset ICDAR17FIG) and for the detection of formulas (from now on, we will call this dataset ICDAR17FOR).
is, as in the previous cases, a dataset proposed for a competition. The dataset contains two types of images: modern documents and archival ones with various formats. In this work we have only taken the modern images (797 images in total, 599 for training and 198 for testing).
is a proprietary dataset of PDF files containing invoices from several sources. The PDF files had to be converted into images. This set has 515 images in the training set and 172 in the testing set.
is a dataset that shows a great variety in language type, page layout, and table styles. Over 1,500 conference and journal papers were crawled for this dataset, covering various fields, spanning from the year 1970. to latest 2011 publications. In total, 2,000 pages in PDF format were collected. The dataset is composed of Chinese (from now on MarmotChi) and English pages (from now on MarmotEn): the MarmotChi dataset was built from over 120 e-Books with diverse subject areas provided by Founder Apabi library, and no more than 15 pages were selected from each book, this dataset contains 993 images in total, 744 were used for training and 249 for testing. And the MarmotEn dataset was crawled from the Citeseer website, this dataset contains 1,006 images in total, 754 were used for training and 252 for testing.
is comprised of a variety of documents which includes technical reports, business letters, newspapers and magazines. The dataset contains a total of 2,889 scanned documents where only 403 documents contain a tabular region. We only used the images containing a tabular region in our experiments: 302 for training and 101 for testing.
Using the aforementioned algorithms and datasets, we have trained several models that have been evaluated using the following metrics.
3.4 Performance measure
In order to evaluate the constructed models for the different datasets, we employed the same metric used in the ICDAR19 competition for table detection [icdar19]. Considering that the ground truth bounding box is represent by GTP, and that the bounding box detected by an algorithm is represented by DTP; then, the formula for finding the overlapped region between them is given by:
IoU(GTP, DTP) represents the overlapped region between ground truth and detected bounding boxes and its value lies between zero and one.
Now, given a threshold , we define the notions of True Positive at T, TP@T, False Positive at T, FP@T, and False Negative at T, FN@T. The TP@T is the number of ground truth tables that have a major overlap with one of the detected tables. The FP@T indicates the number of detected tables that do not overlap with any of the ground tables. And, FN@T indicates the number of ground truth tables that do not overlap with any of the detected tables. From these notions, we can define the Precision at T, P@T, Recall at T, R@T, and F1-score at T, F1@T, as follows:
Finally, the final score is decided by the weighted average WAvgF1 value:
In the above formula, and since results with higher IoUs are more important than those with lower IoUs, we use IoU threshold as the weight of each F1 value to get a definitive performance score for convenient comparison. Using these metrics we have obtained the results presented in the following section.
In this section, we conduct a thorough study of our approach, see Tables 2–5. Each table corresponds with the results obtained from each object detection algorithm: Table 2 contains the results that have been obtained using Mask R-CNN; Table 3, the results of RetinaNet; Table 4, the results of SSD; and, finally, Table 5 contains the results of YOLO. The tables are divided into three parts: the first row contains the results obtained for the TableBank dataset, the next 9 rows correspond with the result obtained with the models trained by fine-tuning from natural image models, and the last 9 rows correspond with the models fine-tuned from the TableBank models. All the models built in this work were trained using the default parameters in each deep learning framework, and using K80 NVIDIA GPUs.
We start by analysing the results for the TableBank dataset, see the first row of the tables. Each model has its strenghts and weaknesses, and depending on the context we can prefer different models. The overall best model, that is the model with higher WAvgF1-score, is the Mask R-CNN model; the other three models are similar among them. If we focus on detecting as most tables as possible (R@0.6) and not detecting other artifacts as tables (P@0.6), the best model is YOLO. Finally, if we are interested in accurately detecting the regions of the tables (F1@0.9), the best model is RetinaNet. The strength of the SSD model is that it is faster than the others.
Let us focus now on the table detection datasets. In the case of models fine-tuned using natural images, the algorithms that stand out are RetinaNet and YOLO, see Tables 2 to 5 and Figure 1. Similary, the models that achieve higher accuracies when fine-tuning from the TableBank dataset are YOLO and RetinaNet, see Figure 2. As can be seen in Tables 2 to 5, fine-tuning from a close domain produce more accurate models that fine-tuning from an unrelated domain. However, the effects on each algorithm and dataset greatly differ. The algorithm that is more considerably boosted for table detection is Mask R-CNN, that improves up to a 60% in some cases and 42% in mean. In the case of RetinaNet, in mean it improves by 11%, YOLO a 9%, and SSD is the one with the least improvement, only a 5%.
Finally, if we consider the results for the datasets containing figures and formulas, the improvement is not as remarkable as in the detection of tables. In this case, the algorithm that takes a bigger advantage of this technique is YOLO, since in both cases it improves up to a 10%. In the case of RetinaNet, it improves the detection of formulas by 10%, while that of figures barely improves. And in the case of SSD and Mask R-CNN, they are the ones with the least improvement and even have some penalty.
As we have shown in this section, fine-tuning from the TableBank dataset can boosten table detection models. However, there is not a model that outperforms the rest, see Figures 1 and 2. Therefore, we have released a set of tools to employ the trained models, and also employ them for constructing models using fine-tuning on custom datasets.
5 Tools for table detection
Using one of the generated detection model with new images is usually as simple as invoking a command with the path of the image (and, probably, some additional parameters). However, this requires the installation of several libraries and the usage of a command line interface; and, this might be challenging for non-expert users. Therefore, it is important to create simple and intuitive interfaces that might be employed by different kinds of users; otherwise, they will not be able to take advantage of the object detection models.
To disseminate our detection models, we have created a set of Jupyter notebooks, that allows users to detect tables in their images. Jupyter notebooks [jupyter] are documents for publishing code, results and explanations in a form that is both readable and executable; and, they have been widely adopted across multiple disciplines, both for their usefulness in keeping a record of data analyses, and also for allowing reproducibility. The drawback of Jupyter notebooks is that they require the installation of several libraries. Such a problem has been overcome in our case by providing our notebooks in Google Colaboratory [colab], a free Jupyter notebook environment that requires no setup and runs entirely in the cloud avoiding the installation of libraries in the local computer. The notebooks are available at https://github.com/holms-ur/fine-tuning.
In addition, in the same project webpage, due to the heterogeneity of document images containing tables, we have provided all the weights, configuration files, and necessary instructions to fine-tune any of the detection models created in this work to custom datasets containing tables.
6 Conclusion and Further work
In this work, we have shown the benefits of using fine-tuning from a close domain in the context of table detection. In addition to the accuracy improvement, this approach avoids overfitting and solves the problem of having a small dataset. Moreover, we can highlight that apart from the Mask R-CNN algorithm, other algorithms such as YOLO and RetinaNet can achieve a good performance in the table detection task.
Since table detection is the first step towards table analysis, we plan to use this work as a basis for determining the internal structure of tables, and, eventually, extracting the semantics from table contents. Moreover, we are also interested in extending these methods to detect forms in document images.