Line-items and table understanding in structured documents

by   Martin Holeček, et al.

Table detection and extraction has been studied in the context of documents like scientific papers, where tables are clearly outlined and stand out from the visual document structure. We study this topic in a rather more challenging domain of layout-heavy business documents, particularly invoices. Invoices present the novel challenges of tables being often without outlines - either in the form of borders or surrounding text flow - with ragged columns and widely varying data content. We will also show, that we can extract different structural information from different table-like structures. We present a comprehensive representation of a page using graph over word boxes, positional embeddings, trainable textual features and rephrase the table detection as a text box labeling problem. We will work on a new dataset of invoices using this representation and propose multiple baselines to solve this labeling problem. We then propose a novel neural network model that achieves strong, practical results on the presented dataset and analyze the model performance and effects of graph convolutions and self-attention in detail.


Graph Neural Networks and Representation Embedding for Table Extraction in PDF Documents

Tables are widely used in several types of documents since they can brin...

Data augmentation on graphs for table type classification

Tables are widely used in documents because of their compact and structu...

Visual Understanding of Complex Table Structures from Document Images

Table structure recognition is necessary for a comprehensive understandi...

CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images

Localizing page elements/objects such as tables, figures, equations, etc...

FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents

In this paper, we present a new dataset for Form Understanding in Noisy ...

A Saliency-based Convolutional Neural Network for Table and Chart Detection in Digitized Documents

Deep Convolutional Neural Networks (DCNNs) have recently been applied su...

GFTE: Graph-based Financial Table Extraction

Tabular data is a crucial form of information expression, which can orga...

I Introduction

Table detection and table extraction problems were already introduced in a competition ICDAR 2013, where the goal was to extract cell structure from a dataset of mostly scientific tables [1].

We have decided to investigate the problem on invoices, where the aim is different. Namely - even table detection needs to be thought of in the context of document understanding, because invoices are inherently documents with textual information structured into tables. Graphical borders and edges are often present, however, they cannot be used for detection, because there is no general layout and very often there are no borders at all. Moreover, the target is to detect only a table with the so called ’line-items’ (detailed items of the total payable amount) and otherwise extract only one specific information (to find a ’field’). Simply said, not every table inside an invoice should be detected and reported (see the sample invoice in Figure 3

). Usually in commercial applications this problem is tackled using a layout system that detects the layout and extracts the table from a position where it usually happens to be; or employing another classification module, which selects the right table from several proposals. That increases the number of modules in the architecture. Our data consist of tables with not clearly defined borders between columns, headers and rows and therefore the classification needs to be thought of as understanding, because cells with multiple lines of text (aligned like other rows and thus misleading for any heuristic) are present. Our goal is to have a trainable system that could leverage the commonalities present in the data without human support.

Ii Previous Work

The plethora of methods that have been previously used for the task is hard to summarize or compare since all the algorithms have been used/evaluated on different datasets and each have their strengths, weak spots and quirks. However, we found none of them well suited for working with invoices, since invoices in general have no fixed layout, language, captions, delimiters, fonts… Invoices vary in countries, companies and departments and changes in time. In order to read an invoice you need to understand it.

In literature there are examples of table detection using heuristics [2], using layouts [3], regular expressions [4], or leveraging the presence of lines in tables [5, 6, 7, 8], or using clustering [9]. A great survey can be found in [10].

Tables were searched for also in HTML [11, 12], free text [13] or scientific articles with a method based on matching captions with content [14].

Machine learning methods and deep neural networks were also employed in several papers. The work [15] aims at scientific documents using fine taylored methods stacked atop each other. Reference [16] uses Fast R-CNN architecture with a novel idea of Euclidean distance feature to detect tables (which was compared to Tesseract). Reference [17] also uses (pretrained) Fast R-CNN and FCN semantic segmentation model for table extraction problem. In [8] work has been done on detection problem bottom up using the Hough transform, and extraction was solved with Markov networks and features from the cell positions. Reference [18] uses convolutions over the number (and sizes) of spaces in a line. A deep CNN approach was being investigated in [19], which combined CNNs for detecting row and columns, dilated convolutions, CRFs and saliency maps, they have also developed a webcrawler to extend their dataset. We tried and failed to get working results using the YOLO architecture [20] on our dataset. Because it emerged after the development of R-CNNs, Fast R-CNNs and Mask R-CNNs (which are used in some table detection works), it was a natural experiment to run.

For document understanding, a graph representation of a document was examined in [21, 22], finding similar documents and reusing their goldstandards was done in [23].

Iii Methodology

We would like to define our target as creating a model for document understanding with relevant information detection and classification. The basic unit of our attention will be a word in a document with its area and possibly other features (see PDF format text data organization [24] for example). In this text, we will be calling them simply as ’wordboxes’.

With line-item table detection we will understand a method, that could classify each wordbox in a document as being a part of a line-item content or not. To understand the rest of the document, boost the learning process and give other structured information a meaningfull class, we have employed various classes. Specifically, the output of the network for each wordbox is a combination of one or more classes that describe the meaning and context of the word. The classes are acquired from annotations and, as it turns out, we are dealing with a multilabel problem, i.e. 35 classes in total, examples being the total amount or recipient adress.

To conclude, based on that model’s line-item prediction, various methods like clustering or outlier detection or convex hull could be employed to either detect the table or extract table contents, which is out of a scope of a single article.

Iii-a Metrics and evaluation

The scores we will observe during evaluation on test and validation splits are:

  • scores on line-item wordbox classification.
    At [1] a content oriented metric was defined for table detection on character level - each character being either in the table or out of the table. For us the basic unit is a wordbox, so we will define our metric per word to be the score of table body wordbox class classification.

  • For other classes we will be looking at micro scores from positive classes, because the counts of positive samples are outnumbered by the negative samples (in total, the dataset contains positive classes).

We chose micro metric aggregation rule, because it gives bigger importance to bigger documents (in the number of wordboxes) which we consider being more difficult for both human and machine.

Apart from the model variations and baselines, we have not evaluated other methods from referenced papers on our dataset, because they did not fit our aim (to detect only some tables and to have as few as possible trainable modules). So in our case, the baselines to compare against, would be just a logistic regression over the model features.

Iii-B The data and their acquisition process

The data were acquired as a results of work of annotation and review teams with automated preprocessing and error-finding algorithms. We have decided to focus our attention on ’easy tables’ first. We define an easy table as a table whose header (if present) is a row and for which the following holds for each column (and row): The column (row) can be selected by a rectangle in such a way that no texts from other column (row) significantly (by more than one or two characters) overlaps the rectangle. Columns and rows, while not important for table detection as a task, are important for our preprocessing stage, which we will describe. First, all annotations are snapped to boxes of words that are located inside the original annotations, with the border case being an intersection, which we allow to be at most the size of a median character size in the dataset. Other intersecting annotations are not changed. Then if any same class annotations do overlap we will adjust their edges to lie half way in between to be non-overlapping. Finally, table bodies are found as union of overlapping rows and columns. Table header annotations are processed as any other rows. Before these rules were applied, the system reported various errors of annotators in roughly of the annotation labels (rows being swapped with columns, degenerate rectangles, forgotten annotations, etc.).

Other classes than line-items have been annotated by drawing a rectangle over the area containing the target text possibly composed of more words. Manual inspection of these classes have revealed that the annotations do overlap neighbouring words, so we have decided to select only the wordboxes which do overlap by more than with the annotation box as goldstandard labels.


We have a dataset with 3554 PDF files at our disposal with line-item table header and table body annotations and also with other structural information (noted as ’small’ in the results) and then a bigger dataset of 25071 PDF files with just structural information (noted as ’big’ in the results). Each dataset is randomly split into 3/4 train data and 1/4 validation parts, chosen by random.

The validation set, even though it is compsed of unseen documents ducring training, is only measuring adaptation and not true generalization, because it can contain similar document layouts from similar vendors. So in addition, we have created another set of 83 documents, that have different layouts to all others in the small dataset, so we could measure the generalization and not only adaptation to the validation dataset.

We have decided to also publish an anonymized version of the small dataset, that would contain only the positions and sizes of wordboxes and annotations, no picture information and no readable text information – only a subset of some textual features. The dataset is to be found at

Iii-C Our approach

We want to operate based on the principle of reflecting the structure of the data in the model’s architecture, Machine learning algorithms tend to perform better that way.

What will be the structured information at the input? We have decided to present wordboxes ordered (see below) as temporal data with features, as it is a native format for many neural net architectures.

In addition we will teach the network to not only detect line-item table in general, but also to detect a header in the table, because that could provide a meaningful information, because the headers are always different from the contents.

The features of each wordbox are:

  • Geometrical:

    • By geometrical algorithms we can construct a neighbourhood graph over the boxes, which can be then used by a graph CNN if we bound the number of neighbours on each side of the box by a constant. This will be presented to the network by a special integer input that defines neighbour indexes (neighbours are ordered by direction and then by distance).
      Neighbours are generated for each wordbox () as follows - every other box is assigned to a side of , that has it in its field of view (being fair ), then the closest are chosen for that side. For example with see 1, note that the relation does not need to be symmetrical, but when higher number of closest neighbours will be used, the sets would have bigger overlap.

    • We can define a ’reading order of words/boxes’. In particular, based on the idea that if two boxes do overlap in a projection to axis by more than a given threshold, set to in our experiments, they should be regarded to be in the same line for a human reader. This not only defines an order of the boxes in which they will be given as temporal data to the network, but also assigns a line number and order-in-line number to each box (and by rotating the document and running the same algorithm, we get a ’column number’). These integers are then subject to a positional embedding. Note, that the exact ordering/reading direction (left to right and top to bottom or vice versa) should not matter in the neural network design, thus giving us the freedom to process any language.

    • Each box has 4 coordinates (left, top, right, bottom) that should be presented to the network also by positional embedding.

  • Textual:

    • Each word can be presented using any embedding, in our case we will use taylored features that take the frequency of characters into account - namely the frequency of all characters, the frequency of first and last two characters, length of a word, number of uppercase and lowercase letters, number of text characters and number of digits (these numbers are defined as integers and then divided by just to safely fall into the interval). And finally, if the word is in fact a number, then the number scaled and cropped against three different scales into interval. The reason behind these features is that in an invoice there would be a larger number of named entities, ids and numbers, which are not easily embedded.

    • Trainable word features are employed as well, using convolutional architecture over one hot encoded characters. Through all this work, the set of available characters is “abcdefghijklmnopqrstuvwxyz0123456789 ,.-+:/%?$£€#()&”’ after deaccenting and lowering each character, all others are discarded. We do not use trainable word features for the baseline in our experiments.

  • Image features:

    • Each wordbox has its corresponding crop in the original PDF file, where it is rendered using some font settings and also background picture, which could be crucial to line-item table (or header) detection, if it contains lines, for example, or different background color or gradient. So the network recieves a crop from the original image, adjusted to be bigger than the text size to see also the surroundings. We do not use image features for the baseline in our experiments.

Each presented feature can be augmented, we have decided to do a random percent wiggle on coordinates and textual features representation.

Fig. 1: Sample invoice with edges defining neighbourhood wordboxes. Only the closest neighbour is connected for each wordbox.

Iii-D The architecture

As can be seen in Figure 2 and as we have stated before, the model uses 5 inputs - picture of the whole document , grayscaled; features of the wordboxes, including their boundingbox coordinates; on-hot characters with 40 one-hot encoded characters per each word; neighbour ids - integers that define the neighbouring wordboxes on each side of the wordbox; and finally integer positions of each field defined by the geometrical ordering.

The positions are embedded by positional embeddings (defined and used in [25, 26], we use embedding size equal to 4 dimensions for and 4 for , with divisor constant being ) and then concatenated with other field features.

The picture is reduced by classical stacked convolution and maxpooling approach and then from its inner representation, field coordinates (left, top, right, bottom) are used to get a crop of a slightly bigger area (using morphological dilation) which is then appended to the field. Finally we have decided to give the model an ability to look at the whole image, which is flattened and then processed to 32 float features, which are also appended to each field’s features.

Before attention, dense, or graph convolution layers are used, all the input features are concatenated.

Our implementation of the graph convolution mechanism concatenates features from the sequence by defined indices and then uses just a Dense layer that sees all the neighbours of each node. To note, our graph has a regularity that allows us to simplify the graph convolution - there does exist an upper bound on the number of edges for each node, so we do not need to use any general form graph convolutions as in [27, 28].

We have also employed a convolution layer that can see the ordering dimension (called ’convolution over sequence’ later in this text).

The rest of the network handles images and crops. The final output branch has an attention transformer module (from [25]) to be able to look at all the fields in hope that denser and regular areas (of texts in a table grid) can be detected better. Our attention transformer unit does not use causality, nor query masking and has 64 units and 8 heads.

Finally the output is a multilabel problem, so sigmoidal layers are deployed together with binary crossentropy as the loss function. Possible variation was focal loss, which produced slighly worse score and is shown in the results below. Also other activation functions apart from sigmoidal were examined, but tend to underperform.

The optimizer was chosen to be Adam. Model selected in each experimental run was always the one that performed best on the validation set in terms of loss, while the ’patience’ constant was 10 epochs. Batched data were padded by zeros (per batch, not using global padding) accompanied by padded zero sample weights. Class weights were chosen based on positive class occurrences. The network has

trainable parameters in total.

Fig. 2: The model architecture. All features are concatenated together before the self attention mechanism and final layers. The cropping of the picture, embeddings and graph convolution all happen inside the network. Note that can be also called a time distributed dense layer.

Iv Experiments

The approach was tested on different data settings and different architectures. There are 4 groups of experiments:

  1. Comparing logistic regression baseline against the neural network.
    To note, logistic regression baselines use all the inputs except the picture and trainable word embeddings. To inspect the importance of neighbouring boxes, we have compared baseline without neighbours and baseline with included information about one or more neighbours at each side (if present).

  2. The importance and effect of each block of layers and each input and other parameters.
    The choice of modules to test was ’convolution with dropout after attention’ to test the dropout layer, ’convolution over sequence’ for the importance of input ordering and attention. Experiments dropping the graph convolution were done in variation of neighbours. Experiments on anonymized dataset fall also into this category. We have also tested the focal loss function [29], note that we do not vary final activations in our experiments here and use only sigmoidal, because they had worse performance in earlier development process.
    Multi-task (lineitems and other fields) classification problem is selected for all these experiments.

  3. Specialization on a task where only line-items were classified and specialization on a task with all but line-items.

  4. Evaluating the model’s adaptation performance on the big dataset (without line-items).

We will not be optimizing the number of neurons in the layers.

Iv-a Results

Table I summarizes experiments comparing the model against the logistic regression baseline, both with varying number of neighbours. The logistic regression baselines did improve with more neighbours, but failed to generalize. We can notice the big difference between line-item table detection and other classes coming from heuristical observation that line-item table detection can be often simplyfied to a task of finding the biggest table. And the results also does reflect the nature of a specialized structured document, which invoices indeed are - a human could say what the biggest table in such document is, but to classify all the structured information is not easy for a person not working with invoices.

On the other hand, the optimal number of neighbours for the final architecture was 1, but we can notice, that 2 neighbours do help line-item table body detections. We have designed the algorithm with more than one neighbour in mind (with a single neighbour, the relation is not symmetrical), so other positional features are possibly being exploited more efficiently.

Table II shows, that the multihead attention module helps with generalization to unseen layouts, ommiting the module makes the network prioritize adaptation on already seen layouts. Also without attention, the number of training epochs was twice (27) as much as with attention (13). Also we can see, that focal loss helps line-item header detection, which comes from the fact, that focal loss should prioritize smaller classes. The decrease of the nonline-item score for focal loss comes from the fact, that focal loss would again prioritize more rare classes, which affects the micro metric.

The importance of convolutional layer over the sequence might come from our initial guess that this would give more importance to beginnings and endings of lines of words.

Table III compares different inputs and dataset choices. Although the architecture was optimized on the small dataset, the results imply that the model has the capacity to adapt and generalize also on bigger datasets. Looking at the anonymized version of datasets, without some inputs, it can be concluded that the network can learn to detect tight areas of evenly spaced words, being the line-item table. Also even base text features help the model generalize well. Overall the score on anonymized dataset means that the positional information is passed correctly and embedded in a right way for the network.

In table IV there can be seen that the tasks of finding line-items and other structural information do boost each other, with one exception being the header detection - it does help adaptation, but when omitted, the generalization score is higher.

The architecture provided on Figure 2 is the ’complete model’, that uses binary crossentropy, all inputs and all modules and a single neighbour at each side of each box. Its generalization performance was on detecting line-items and for other classes ( on similar layouts). To verify what the line-item detection scores mean in practice, we have run the prediction on sample invoices (see Figure 3), where the table is properly marked and the 2 false positives could be filtered out as outliers to a rectangular structure (which the table has to have).

Experiments against the baseline Adaptation Generalization
line-items others line-items others
body header micro body header micro
complete model (without neighbours) 0.9666 0.9969 0.8687 0.9242 0.9876 0.6609
complete model (1 neighbour) 0.9738 0.9967 0.8790 0.9389 0.9864 0.6650
complete model (with 2 neighbours) 0.9762 0.9963 0.8749 0.9408 0.9860 0.6629
logistic regression without neighbours 0.7594 0.9477 0.0004 0.7560 0.9362 0.0000
logistic regression with 1 neighbour 0.8664 0.9663 0.1482 0.8071 0.9461 0.0327
logistic regression with 2 neighbours 0.8939 0.9724 0.2276 0.8284 0.9493 0.0525
Experiments with ablation Adaptation Generalization
line-items others line-items others
body header micro body header micro
complete model 0.9738 0.9967 0.8790 0.9389 0.9864 0.6650
focal loss 0.9735 0.9969 0.8557 0.9383 0.9878 0.6398
no convolution over sequence 0.9670 0.9945 0.8638 0.9101 0.9800 0.6237
no attention 0.9780 0.9967 0.8806 0.9348 0.9864 0.6487
no convolution with dropout after attention 0.9646 0.9950 0.8435 0.9168 0.9807 0.6050
Experiments with inputs variations dataset Adaptation Generalization
line-items others line-items others
body header micro body header micro
complete model (all inputs) small 0.9738 0.9967 0.8790 0.9389 0.9864 0.6650
no text embeddings small 0.9702 0.9921 0.7772 0.9108 0.9771 0.5118
no picture, only some text features anonym 0.9694 0.9943 0.4518 0.9185 0.9805 0.4745
no picture, no text features anonym 0.9588 0.9848 0.6836 0.8919 0.9549 0.2152
complete model (all inputs) big N/A N/A 0.8487 N/A N/A N/A
Experiments with training target variations dataset Adaptation Generalization
line-items others line-items others
body header micro body header micro
complete model (all outputs) small 0.9738 0.9967 0.8790 0.9389 0.9864 0.6650
only line-items small 0.9027 0.9950 N/A 0.8762 0.9766 N/A
no line-item header small 0.9736 N/A 0.8777 0.9394 N/A 0.6731
all but line-items small N/A N/A 0.8632 N/A N/A 0.6247
complete model (other than line-items targets) big N/A N/A 0.8487 N/A N/A N/A
Fig. 3: Sample invoice with result of the trained algorithm displayed. Boxes that are predicted to be in a lineitem table are shown with underline and rectangle around. Notice other tabular structures being properly unmarked (up to 2 false positive boxes).

V Conclusions

We have found a fully trainable method for table detection and content understanding in structured documents, that is able to detect only a specific table and extract only some information from others even in the presence of imbalanced classes.

Trying to detect line-item headers in a single model did lead the model to underperform, with a hint to use focal loss for such task. Also we have discovered, that attention module was important to generalization, while using only close neighbours did lead to better adaptation on already seen layouts.

We hope, that our contribution lies not only in the exploration of the method and methodology, the importance of graph convolutions and self-attention for this problem, the relations of the targets in the data, but also in the published annotated anonymized dataset, which had no similar dataset publicly available before.

Future work

Apart from architecture and hyperparameter tuning for bigger datasets, experiments with the usage of different text features or embeddings, image augmentations or comparison with other methods, we would like to mention a possible useage for table extraction.

Given an easy class of tables and the output of our algorithm (and the bounding boxes of words), we could proceed with table extraction step. It should be sufficient to use geometrical algorithm (for example employing sweep lines) to find clear gaps between wordboxes. The gaps should then clearly divide rows and columns.

More interesting case arises in the presence of rows with multiline texts or wrapped columns. The model could still be used to perform joint table extraction with table detection, if we added a concept of ’flags’ and/or edge classes (with neighbouring words). Flags could guide a simple table extraction algorithm, by defining the meaning of ’left/top/right/bottom flag’ as an indicator for events such as ’on the left/top/right/bottom side of the box, there ends/begins another column/row’. The simple algorithm should then sum the flags up and produce a column or row separator based on the result. Or a flag, that would add error awareness, that would detect words which would degenerate the flow of the table. Edge classes, on the other hand, could define relations as ’these two words are in the same row/column’ and then a graph of relations should emerge.


The work was supported by the grant SVV-2017-260455.We would also like to thank to the annotation team and the rest of the research team.