Identifying Table Structure in Documents using Conditional Generative Adversarial Networks

by   Nataliya Le Vine, et al.

In many industries, as well as in academic research, information is primarily transmitted in the form of unstructured documents (this article, for example). Hierarchically-related data is rendered as tables, and extracting information from tables in such documents presents a significant challenge. Many existing methods take a bottom-up approach, first integrating lines into cells, then cells into rows or columns, and finally inferring a structure from the resulting 2-D layout. But such approaches neglect the available prior information relating to table structure, namely that the table is merely an arbitrary representation of a latent logical structure. We propose a top-down approach, first using a conditional generative adversarial network to map a table image into a standardised `skeleton' table form denoting approximate row and column borders without table content, then deriving latent table structure using xy-cut projection and Genetic Algorithm optimisation. The approach is easily adaptable to different table configurations and requires small data set sizes for training.



There are no comments yet.


page 1

page 2

page 3

page 4


Extracting Tables from Documents using Conditional Generative Adversarial Networks and Genetic Algorithms

Extracting information from tables in documents presents a significant c...

TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition

A table arranging data in rows and columns is a very effective data stru...

Visual Understanding of Complex Table Structures from Document Images

Table structure recognition is necessary for a comprehensive understandi...

Neural Networks for Latent Budget Analysis of Compositional Data

Compositional data are non-negative data collected in a rectangular matr...

Line-items and table understanding in structured documents

Table detection and extraction has been studied in the context of docume...

Parsing Table Structures in the Wild

This paper tackles the problem of table structure parsing (TSP) from ima...

Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval

Tables contain valuable knowledge in a structured form. We employ neural...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Problem outline

In the last decades, humankind made many huge leaps forward in communications technology. We have witnessed the birth of the Internet; networks of computers allowing instantaneous digital transmission of knowledge. Yet, paradoxically, the primary method of information transmission in many industries and in academia, even in today’s highly digital world, typically still takes the form of physical documents which are easy for humans to read but hard for computers to process. The text in these documents may now be digital characters in a PDF instead of ink printed on paper (although this is far from guaranteed, particularly in industries like insurance or in local government) but whether digital or printed, valuable hierarchical structure is typically discarded during the process of rendering a document.

Consider the common table: when hierarchically structured relational data is rendered as a table in a document, the data is converted into perhaps the most unstructured form imaginable (including in this article!) – rendered as pixels of an image or characters on a page, where the only structure is to be found in the spatial location of pixels in relation to one another. The data is thus stripped of meta-information relating to its hierarchical structure, and information about relationships of columns and rows is lost.

Perhaps this comes as no surprise when one considers the Second Law of Thermodynamics: the entropy of the Universe always increases; order flows into chaos and structure is lost. It turns out that documents are another good example of this.

Common digital document exchange formats such as PDF do not explicitly represent any information about underlying data structures in tables. Instead, only individual characters and their locations on the page are encoded. Extracting and reimposing structure on data from tables in documents without reference to the original data structure from which they were created is a challenging problem in information extraction.

The idea to use a generative adversarial network (GAN) to extract table structures from scanned documents has been described in Le Vine et al. (2019)

. To make the approach applicable to real world applications a number of additional issues have to be addressed: 1) input images are usually skewed when manually scanned from documents, 2) a set of tables might have particular characteristics significantly different from other tables, 3) training might require a large training set, and 4) achieving stable training of GANs might be challenging. To alleviate this, the study explores the following areas: training with blurry table structures, pre-processing input images to deskew them, model dependence on train set and its size, and, more generally, stable GAN training.

A major concern for applications of GANs is the stability of their training. Since the theoretical understanding of GAN behaviour is still limited, no automatic tuning procedure is available that would guarantee convergence to a stable equilibrium. Instead, we have to rely on a set of heuristics and best practices (see

1.5 for details).

1.2 Mapping Manhattan: bottom-up vs top-down approaches

Existing approaches to the problem of extracting structured tabular information tend to be bottom-up. These approaches first identify pixel-level bounding boxes for table cells, exploiting regularities such as bounding lines and whitespace Kasar et al. (2013); Pinto et al. (2003); Gupta et al. (2019). The bounding boxes are subsequently aggregated into a table.

As a consequence, bottom-up approaches have a tendency to be brittle and sensitive to specific layout quirks. In some cases, they can also depend heavily on correctly setting tuning parameters, such as the number of white pixels required to infer a cell boundary.

The approach by Klampfl et al. Klampfl et al. (2014) uses hierarchical agglomerative clustering to group information at the word layout level, and subsequently builds up a table from sets of words which are assumed to belong to a single column. As with the approaches described previously, this is a bottom-up approach which places the graphical representation of the layout as the highest priority.

Focusing on the graphical structure in this way disregards available information relating to the higher-level structure of the table – it is analogous to creating a street map of Manhattan by walking around at street level and trying to integrate the various turns into a larger map.

A top-down approach, by contrast, assumes that a table in a document is merely an arbitrary graphical representation of an unseen latent data structure. Wang Wang (1996) specifies a model in which the logical and presentational forms of a table are completely decoupled: given an underlying set of data and its hierarchical inter-relations (the logical table), the presentation of such a table in a document is totally arbitrary with regard to its formatting, style, rotation, column widths and row heights, etc.

For a specific graphical representation of a logical table in a document which we want to extract, we can attempt to fit arbitrary representations of candidate logical table structures onto the graphical representation seen in the document. By introducing a measure of distance between the representation of a candidate table structure and the representation seen on the page, a gradient can be followed towards an optimal solution.

In this way, such an approach starts from the assumption that there is an underlying logical data structure which must be recovered, and decouples the search for the correct logical structure from the specifics of the graphical presentation, whilst using the graphical presentation to guide the search. In our street-map analogy, this makes use of the prior knowledge that Manhattan largely takes the form of a grid. Armed with this prior knowledge, we can overlay various different shapes and sizes of grids on top of an aerial photograph of Manhattan, adjusting the grid until it provides a good match to the visual information in the photograph.

1.3 Overview of proposed top-down approach

We propose a two-step generalisable top-down approach to extracting data from tables in documents (shown schematically in figure 1):

Step 1

Translate input images containing tables into an abstracted standard ‘skeleton’ form showing pixel outlines for approximate locations of table cells and boundaries and disregarding actual table content.

Step 2

Produce a first table structure estimate using a simple projection method with the table skeleton from Step 1.

Step 3, optional

Optimise the fit of candidate latent data structures to the generated skeleton image using a measure of the distance between each candidate and the skeleton.


Figure 1: General schematic of the approach

Once a good fit has been found, the data can be extracted from the table image and stored within the discovered structure using standard optical character recognition (OCR) techniques.

1.4 Conditional GAN, xy-cut and Genetic Algorithm

A conditional generative adversarial network (cGAN) adversarially trains two neural networks: a

generator learns to generate realistic “fake” candidates to fool a discriminator which learns to detect the artificially generated input out of a set of real input examples. cGAN has shown remarkable success at generating realistic looking images from input data, for example transforming from aerial photographs to street maps, from greyscale to colour, or filling in outlines with realistic detail Isola et al. (2017). We treat Step 1 of our table extraction method as another form of image translation, by training a cGAN to translate images containing tables into a standardised skeletons denoting table cell borders. Our method for training the cGAN is outlined in section 2.1.1.

An xy-cut method is a projection method that projects an image on x- and y-axis and analyses statistics of the derived projection distributions to approximate locations of horizontal and vertical lines on the original image Nagy and Seth (1984). More details about the method can be found in 2.3.1.

Genetic algorithms (GA) use a population-based approach to sample the search space of possible solutions and to climb a gradient towards an optimum. The underlying representation of a candidate solution (dimensions and offsets of rows and columns of a table) is encoded as a vector of numbers denoted as the

genotype, which is decoded into a phenotype (a rendered image of a table). Individuals within the population are then evaluated by a fitness function acting on the phenotypic representation. The GA optimisation is outlined in section 2.3.2.

1.5 Improved Robustness of GAN training

Several papers have explored different possibilities to improve the stability of GAN training Tim et al. (2016); Lars et al. (2018).

We found that the following combination of approaches is helpful:

  1. Usage of Gaussian weight initialization

  2. Usage of Adam for Stochastic Gradient Descent

  3. Usage of down-sampling (using strided convolutions) instead of pooling in the discriminator and up-sampling in the generator

Moreover, working with gray-scale rather than black-and-white output images for training improves continuity and differentiability of the cGAN objective function. Further ideas, which could be added in the future, include adding Gaussian noise to discriminator input and use of additional training examples for the generator at each cycle.

1.6 Comparison to other approaches

A comprehensive comparison versus state-of-the-art systems is out of the scope of this paper, but nevertheless, to bring context to our approach we would like to comment on two representative commercial approaches: FlexiCapture from Abbyy111 and CharGrid from SAP Katti et al. (2018). FlexiCapture is similar to other approaches we have seen, in that the user is required to manually define a hierarchical template for different “classes” of document (e.g. certain types of forms that the user regularly has to deal with), but the user is helped by some automatic estimation of table separators. This approach does not generalise well to previously unseen table layouts in new documents, unlike our approach which attempts to infer the latent data structure directly from the table image with no manual intervention. CharGrid Katti et al. (2018)

uses encoder-decoder convolutional neural networks to classify elements of a page and discover bounding boxes, including components of tables. This is in some ways similar to our approach for identifying table components visually in 2D space. In the case of tables, however, these elements must still be integrated together according to a bottom-up approach (see section

1.2) based on character locations on the page. Whilst the CharGrid representation for 2D text is powerful, it nevertheless does not make specific use of the prior assumptions relating to latent structure in table data which a top-down approach, such as we are proposing, can benefit from.

2 Methods

We formulate table structure as an image-to-genotype translation problem, whereby a table image is translated into a set of numbers (genotype) that fully characterise the table structure, i.e. column, row, and table positions and sizes. This is solved in the following stages: 1) translating an input table image (scan) into a corresponding table cell border outline image - table ‘skeleton’, and 2) transforming the table skeleton into a latent table data structure - table genotype, and optionally 3) optimising the table genotype if needed. (figure 1).

The first stage masks out any text present in a table image, keeps any table cell borders present on the image, and adds any missing cell separators as implied by whitespace dividers in the input image. The tranformations are achieved using a conditional Generative Adversarial Network (cGAN) Isola et al. (2017). Then, the second and third stages estimate and optimise (if needed) a table structure parameterisation using the table skeleton image derived at the first stage. The xy-cut projection method Nagy and Seth (1984) estimates an initial table genotype. Meanwhile, Genetic Algorithm optimises table genotype in a way that the corresponding table phenotype (table rendered as an image) is close to the table skeleton produced by the cGAN (figure 1). The optimisation is performed within an a-priori selected class of tables, e.g. whether cell merging/splitting is allowed across rows/columns, or not. Note that, while it can be relaxed, an underlying assumption is that there is no text surrounding a table on an input image, and that all elements on an input image belong to one table.

2.1 Table image to table skeleton transformation

A table image is mapped into a corresponding table skeleton (all row and column dividers, even if not present on the table image; and text masked out) using cGAN.

2.1.1 cGAN architecture

First, to reduce cGAN model complexity (number of parameters) and to optimise image-processing time, a table scan is resized to a smaller image of 256x256. Then the resized table image is transformed into a table skeleton using the cGAN architecture from Isola et al. (2017) as follows.

A conditional GAN approximates a mapping from input image and random noise vector to output image , so that . Generator

is trained to produce outputs that cannot be distinguished from the target output images by a discriminator

that is adversarially trained to detect ‘fake’ images. Generator is an encoder-decoder network, so that an input image is passed through a series of progressively down-sampling layers until a bottleneck layer, where the process is reversed. To pass sufficient information to the decoding layers, a U-Net architecture with skip connections is used Ronneberger et al. (2015); Isola et al. (2017), and a skip connection is added between each layer and via concatenation, where is the total number of layers, and is a layer number in the encoder Isola et al. (2017). Further, following Isola et al. (2017), random noise is provided only in the form of dropout applied on several layers of the generator at both train and test times.

The discriminator

takes two images – an input image and either a target (ground truth) or output image from the generator – and assigns a probability that the second image is generated by a generator or not, i.e. the image is real. A convolutional PatchGAN architecture is used for the discriminator D

Li and Wand (2016); Isola et al. (2017) that penalises output image structure at the scale of patches. In particular, the discriminator classifies whether each x patch in an image is real or fake, so that equals x matrix containing probabilities for the patches in to be real, i.e. representative of a target image given input image . We evaluated four discriminators of varying complexity, so that each configuration has a different number of convolutional layers used ranging from three to six (for example, Isola et al. (2017) uses five layers for their image-to-image transformation).

2.1.2 cGAN optimisation and inference

The cGAN parameters are selected by optimising the following objective function with respect to generator and discriminator :


Where is - distance between an image produced by a generator and an output image averaged over the training set, is a constant weight (), and expresses a cGAN objective to generate images that cannot be discriminated from real images:


Mixing of and is found to be beneficial in previous studies, with encouraging less blurring and suppressing artefacts Pathak et al. (2016); Isola et al. (2017).

The cGAN is trained by alternating between optimising with respect to the discriminator , and then with respect to the generator Goodfellow et al. (2014). Optimisation is performed on a set of 4,000 table scan/skeleton pairs (see section 2.2 for details of training set generation) using stochastic gradient descent applying the Adam solver Kingma and Ba (2014), with learning rate 0.001, and momentum parameters =0.9, =0.999. Following Ulyanov et al. (2016) and Isola et al. (2017), the generator is used with dropout and ‘instance normalisation’ (batch normalisation with a batch of 1) at inference time.

2.2 Random table generator

Due to the lack of empirical table image-to-skeleton and table image-to-genotype data available for training and testing, we synthetically generated the required sets using random table generator for four distinct table configurations.

2.2.1 Table genotype

A table structure is described as a set of the following numbers, defined as table genotype:

  • table cardinality – number of rows , and number of columns ;

  • coordinates of the upper left corner of the table ;

  • vector of row heights ;

  • vector of column widths 0.

The number or rows and number of columns given a-priori express an expected maximum table cardinality. For a specific table, when a number of rows and columns is smaller than and , respectively, the corresponding row heights and column widths are set to zero in the table genotype. The genotype represents only a simple 2-D table class in the case when merging of table cells is disabled. The table class can be (but not pursued in this study) further extended by introducing a cell merge indicator array into the table genotype, where is set to a unique number indicating what type of cell merging (if any) is needed for cell in th row and th column.

2.2.2 Table phenotype

Each table genotype can be used to parameterise an XHTML representation of the table which is rendered into an image - table phenotype - using the imgkit Python library. Two types of tables borders are examined for the corresponding table skeletons - solid and blurry. The solid borders are parameterised in XHTML as:

 border: 3px solid black;

and the blurry borders are given by:

 border: 3px solid black
 box-shadow: inset 0px 0px 7px 3px black;

so that the table border blur is 7 pixels, the spread is 3 pixels, and its offset is set to zero. The blurry borders allow accounting for the potential non-uniqueness of an optimal table skeleton corresponding to a table image, as reported in Le Vine et al. (2019), where the authors note that moving a column or row border by a small number of pixels still results in a correct interpretation of a table structure. Furthermore, working with gray-scale rather than black-and-white images might help improving continuity of the cGAN objective function (eq. 1), especially, continuity of its component that accounts for distance between generated and output image.

Training and test sets for cGAN consist of table scan/skeleton image pairs, and a table scan image requires adding random text into the table cells as well as choosing, at random, row and column separators to be visualised in the rendered image. The corresponding table skeleton ground truth excludes the text and shows all column and row separators. Two sets of such table skeletons are generated - with non-blurry and blurry borders - figure 2 panels A and B show an example of a blurry pair, and figure 2 panels A and C show an example of a non-blurry pair.

Figure 2: Examples of ’blurry and ’non-blurry’ table pairs: A - table image, B - corresponding blurry table skeleton, and C - non-blurry table skeleton.

Following Le Vine et al. (2019), where it is noted that a cGAN model is sensitive to text font size, inter-column / inter-row spacing, and cell width and height, the model is trained on tables with four different configurations defined in table 1. The training set and the test set each contain 4,000 tables, so that there are 1,000 tables of each of the four configurations.

Configuration Rows Cols x-offset, ptx y-offset, ptx Row height, px Column width, px Word length, chars Words per cell Font size, px
Base 2–6 2–6 0–70 0–70 40–90 70–100 5–9 2–4 10
Larger font 18
Smaller font 6
Short cells 4–10 4–10 20 40–60 1–4 1
Table 1: Four table configurations. “” refers to values the same as the ‘Base’ configuration.

2.3 Estimating table structure for a table skeleton

The derived table skeleton is projected on horizontal and vertical axis to estimate the corresponding table structure; and, optionally, can be used to parameterise an objective function that defines table structure fitness for optimisation by Genetic Algorithm.

2.3.1 Projection-based estimation of table structure

Table structure is estimated by projecting a colour-modified table skeleton onto horizontal and vertical axis (table borders are assumed nearly parallel to image borders) using the xy-cut method Nagy and Seth (1984). The colour modification classifies each image pixel as ‘black’ or ‘white’ using a threshold of 125 on pixel luminance value, so that pixels with luminance above the threshold are classified as ‘white’, or ‘black’ otherwise. The following RGB-to-luminance (“black-and-white” portion of an image) transformation is used ITU-R (2011):


where is red intensity, is green intensity, and is blue intensity for a pixel; each intensity value ranges from 0 to 255.

It is noted that cGAN image may be somewhat noisy, so that an image may have a small number of short black line artifacts that do not correspond to any column/row separator. To minimise the effect, only those vertical/horizontal lines from the xy-cut are included that are at least 0.25 times as long as the longest identified vertical/horizontal line for the image. Both parameters – the white-colour threshold (125) and the length parameter (0.25) – are manually selected using an independent (validation) set of table images that is neither used for training nor testing.

2.3.2 Optimisation with genetic algorithm

The initial table structure guess can be further optimised using a Genetic Algorithm (GA). GA uses reproduction, crossover, and mutation to evolve a population of tables from epoch

to epoch , so that table structure fitness is improved following the gradient of an objective function. Each table has its own genotype and phenotype as defined in section 2.2, and the algorithm selects candidate tables based on the table fitness defined below. Parameters for GA, e.g. mutation rate, survival rate, are set to the values used in Le Vine et al. (2019) and are given below. The algorithm is specified to have converged when table fitness does not improve more than 1% over three consecutive epochs.

In the algorithm, reproduction carries the best table structure over to the next epoch with no mutations (elitism), as well as 70% of other table structures with mutation. Further, the offspring mutation modifies table upper left corner coordinates, individual row heights, and column widths with a probability 0.1 for each entry in the genotype. The offspring mutation also modifies table structure (adding, merging, removing column/row) with probability of 0.1 per table dimension (columns or rows), so that the three structural operations are equally likely (probability of 0.03). Lastly, crossover is based on two parents, so that the upper left table x-coordinate and columns are inherited from the first parent, while the upper left table y-coordinate and rows are inherited from the second parent.

Following Le Vine et al. (2019), the objective function that we minimise is the fraction of non-overlapping non-white pixels between cGAN and the candidate image, calculated as:


where pixels of the output image from cGAN and a candidate table phenotype are scaled to values between 0 and 1, with 0 corresponding to a black pixel, and 1 corresponding to a white pixel. The objective function 4 varies between 0 and 1, so that the best match corresponds to 0 and the worst guess corresponds to 1.

As shown in Le Vine et al. (2019), while being computationally intensive, a further table optimisation that follows an initial table estimation, brings limited improvements. The reason to include this procedure here is two-fold: 1) to assess GA benefits for tables with blurry borders (not considered in Le Vine et al. (2019)), and 2) to introduce an optimisation strategy for tables with more complex structures than the ones used in the work, e.g. tables with merged/split cells.

2.4 Performance metrics

The proposed method is applied to images with 595x842 pixel resolution (corresponding to A4 paper size at 72 ppi) that are generated by the random table generator described in section 2.2 with parameters given in table 1. Conditional GAN is trained using 4,000 such table images, and a further 4,000 images are generated for performance evaluation, so that each set has 1,000 images for each of the four table configurations described in table table 1. The performance is assessed separately for each table configuration (table 1) by comparing row and column numbers, upper left corner positions, row heights and column widths and calculating the following metrics:

  1. Percentage of tables with correctly identified

    • row number,

    • column number;

  2. Average errors in incorrectly identified

    • row number,

    • column number;

  3. Average errors in table upper left corner

    • x-coordinate,

    • y-coordinate;

  4. average percentage errors in table

    • row heights,

    • column widths.

Errors are calculated as true value less corresponding predicted value, and relative errors are errors divided by corresponding true values. Average errors in column and row numbers are calculated only for tables where those are incorrectly identified. Meanwhile, error averages for column widths and row heights as well as average errors for upper left corner table coordinates are calculated only for tables with correctly identified numbers of columns and rows. This avoids double penalisation for identifying rows/columns incorrectly, e.g. when a top row is erroneously added or excluded, the y-coordinate is under-estimated or over-estimated, respectively.

2.5 Processing skewed table images

In order to asses the robustness with respect to skewed input images, additional test data sets were derived from the original test images by rotating each image by a fixed angle using the ImageMagicLLC (2019) rotate command. Integer valued angles between -30 and +30 degrees were used in steps of 5 degrees resulting in 13 test sets.

The deskewing process is shown for an example input table in figure 3. First the input images are assured to have skew angles compatible with 0, and then the different test sets are derived as described (the example of a 15-degree skew is shown in figure 3.) Images were rotated with the following command using the imageMagic package:

 convert -background white -bordercolor white
         -rotate D  A.png B.png

Where D is the positive or negative integer number of degrees. We changed the default from black to white background and border colour since black edges and triangles introduced by the rotation will otherwise confuse the GAN and reduce performance.

It is important to note, that both the rotation and deskewing (back rotation) operations translate images towards the bottom right as well as extend overall image dimensions (in number of pixeles) to accommodate the rotated image while keeping the original image content size constant within a larger image (Panel C of Fig 3).


Figure 3: Illustration of the four-step skewing/deskewing process. An example input table without skew (A). The same table rotated by 15 degrees (B). The result of applying the deskew algorithm with B as input (C). Cropping applied to C (D). The image sizes in pixels are: 595x842 (A), 795x970 (B), 1015x1140 (C), 595x842 (D).

To satisfy the GAN input dimensions, the processed images were cropped as follows:

 convert C.png -crop DXxDY+X0+Y0 D.png

Here DX and DY are dimensions of the output image (in pixels), and X0 and Y0 define the position of its upper left corner relative to the input (larger) image. The values were calculated by comparing the dimensions of images C and A (denoted and for the width and and for the height.):


Their accuracy is thus directly related to the accuracy of the skew angle estimation.

The deskewing is based on a Hough transformation using the deskew packageMauder (2019) as follows:

 deskew -a 35 -b FFFFFF -o C.png B.png

The maximal range of skew angles considered (using parameter in the above command) is an important input, as performance decreases rapidly for angles larger than the default of degrees. Small remaining skew angles are found after the deskew transformation in many cases, especially for highly skewed input images. To improve this, the deskew command is applied multiple times in sequence (five times in this study), so that the remaining average skew angles are below 2 degrees for all input images.

3 Experiments

We conducted the following types of experiments are conducted for table structure estimation to compare models trained using: 1) blurry vs. non-blurry table skeletons, 2) different complexities of the cGAN discriminator, 3) different sizes of train data sets. The xy-cut method is utilised to assess table structures, while the GA optimisation is only used to show improvements achievable when more time is available for table processing. Lastly, the algorithm is applied to example tables from ICDAR2013 competition Orsi. (2013) to illustrate a need for a training set with more realistic table configurations.

First, a cGAN model with 5 layers in its discriminator is trained on 4,000 table pairs with blurry table skeletons (see figure 1 A, B), and table structure is estimated using xy-cut method. The performance is compared with the model performance from Le Vine et al. (2019), where an identical model configuration is used (cGAN with 5 layers in discriminator and xy-cut) that is also trained on 4,000 table pairs, but with non-blurry table skeletons (Table 2).

The results reported in Table 2 for the ‘non-blurry’ model are somewhat different from the results reported in Le Vine et al. (2019), due to slightly different metrics used as well as due to a different test set. When the ‘non-blurry’ model is trained on the new data set generated for this paper with the same table configurations as in Le Vine et al. (2019), the corresponding performance metrics are significantly different from the ones produced by the pre-trained model (based on three re-starts) – the ‘base’ and ‘larger font’ configurations have significantly worse scores, and the ‘smaller font’, and ‘short cells’ configurations have better scores.

Overall, table 2 shows that the ‘blurry’ model trained on blurry skeletons has a superior performance than the ‘non-blurry’ model based on its high column and row number recall rates (97.9–100.0 for each table configuration), while the non-blurry recall rates are 64.4–96.6. The improvement might be due to the improved continuity and differentiability of the cGAN objective function (eq. 1), which operates with gray-scale, rather than black-and-white, output images when blurry model is trained. Model error in column and row number is somewhat better for the non-blurry model, so that when the blurry model is wrong it makes a slightly larger error (misses/adds more rows/columns) than the non-blurry model. Note, the blurry table set with errors in column /row number is much smaller, so that the statistics are not directly comparable between the models. Lastly, all errors in the table upper left corner coordinates, column widths, row heights are small for both configurations, ranging between 0.2 and 5.0% for the non-blurry model, and between 0.0 and 3.3. for the blurry model.

-2.5cm Non-blurry, , Blurry, , Blurry, , Metric Base Small Large Short Base Small Large Short Base Small Large Short Correct row number, % 88.7 90.8 64.4 81.1 98.6 98.1 97.9 98.7 99.8 98.9 96.4 99.4 Correct column number, % 92.6 92.5 91.6 96.6 99.6 98.1 93.3 100 99.7 99.3 99.3 100.0 Error in row number 0.6 -0.3 -1.0 -0.6 -1.0 -1.2 1.0 -1.0 0.0 0.3 1.1 -0.7 Error in col. number -0.2 0.5 0.2 0.9 -1.0 -1.7 0.9 - -1.0 -1.7 1.2 - Error in x0, px 1.0 1.7 -0.2 1.2 1.5 3.3 1.3 2.8 0.3 0.7 -0.9 0.4 Error in y0, px -1.5 -0.1 -5.0 -0.7 -0.9 -1.8 -0.4 -1.7 -2.1 -1.9 -1.5 -1.8 Error in column width, % -0.3 -1.5 0.0 -0.9 0.5 -0.1 0.8 0.1 1.1 1.5 1.7 1.1 Error in row height, % -0.2 -1.5 1.0 -2.1 0.2 0.9 0.0 2.2 0.8 0.9 0.1 2.2

Table 2: Metrics for models trained on blurry and non-blurry table skeletons with or convolutional layers in discriminator denoted as , or evaluated on four table configurations - base (‘Base’ as shown in the table), smaller font (‘Small’), larger font (‘Large’), and short cells (‘Short’) configurations, all trained on table pairs. Best performances for correct number of rows and columns are given in bold font.

Second, four other models are trained on 4,000 table pairs with blurry table skeletons, so that they differ by the number of convolutional layers in the cGAN discriminator – six, four, and three layers (the original configuration adopted from Isola et al. (2017) has five layers). The six-layer configuration has much lower row/column number recall rates than the five-layer model for two of the table configurations – ‘base’ and ‘larger font’, and similar performances for the other two table configurations (not shown). Meanwhile, the four- and three-layer models have, on average, slightly better recall rates, and the three-layer model has the best recall rates, except for its row number recall for the ‘larger font’ configuration (Table 2).

All performance metrics above are based on table structure estimated from a cGAN skeleton using the xy-cut method, and figure 4 illustrates estimation procedure for a ‘short cells’ table configuration that has the row number incorrectly identified (one row missed) as well misaligned row dividers (Figure 4 C). To refine the estimate, table structure is further optimised using GA algorithm described in 2.3.2, so that the row dividers align better, but the missed row is not added, due to the GA objective function specifics in equation 4.

Figure 4: Procedure illustration example for a table image A: B - cGAN generated skeleton, C - xy-cut-based estimate (blue) overlaid with the table image, and D - GA-optimised estimate (blue) overlaid with table image.

Third, the best performing ‘blurry’ model with three layer discriminator trained on 4,000 samples is additionally trained on 10% and 1% of the data set size (400 and 40 tables, respectively) with equal number of tables from each table configurations (100 and 10 tables, respectively). Table 3 shows that the model trained only on 400 table pairs (10% of the data) performs comparably to a model trained on 10 times the data (4,000 samples); and that the model trained on only 1% of the original data size has high recall rates and low errors in table positioning and column/row sizes. Note, all models are assessed on the same test data set with 4,000 tables, so that each table configuration is represented by 1,000 tables.

-2.5cm Blurry, , Blurry, , Blurry, , Metric Base Small Large Short Base Small Large Short Base Small Large Short Correct row number, % 99.8 98.9 96.4 99.4 99.4 96.8 98.1 98.6 93.4 84.7 93.3 73.1 Correct column number, % 99.7 99.3 99.3 100.0 100.0 98.4 100.0 100.0 99.1 94.6 97.9 99.7 Error in row number 0.0 0.3 1.1 -0.7 -0.3 -0.6 -0.2 -1.0 0.0 -0.8 -0.4 -0.7 Error in column number -1.0 -1.7 1.3 - - -1.4 - - -1.0 -1.3 -0.9 -0.3 Error in x0, px 0.3 0.7 -2.0 0.4 0.5 0.6 -0.4 0.3 2.0 1.7 1.9 0.5 Error in y0, px -2.1 -1.9 -1.9 -1.8 -2.1 -1.5 -2.2 -1.5 -0.3 0.4 -0.6 -0.7 Error in column width, % 1.1 1.5 1.7 1.1 1.3 1.9 1.0 1.2 0.0 1.0 -0.2 0.8 Error in row height, % 0.8 0.9 0.2 2.2 0.8 1.1 0.1 2.1 -0.1 0.5 -0.7 0.3

Table 3: Metrics for models trained on different train set sizes - , , and table pairs with blurry skeletons, with convolutional layers in discriminator denoted as , and evaluated on four table configurations – base (‘Base’ as shown in the table), smaller font (‘Small’), larger font (‘Large’), and short cells (‘Short’) configurations. Best performances for correct number of rows and columns are given in bold font.

Lastly, the best model – blurry model with three-layer discriminator, trained on 4,000 samples – is applied to selected tables in the ICDAR2013 table competition data set Orsi. (2013). As the model is not trained to distinguish a table from document surrounding, e.g. other text and figures, test tables are cropped out (with some white buffers) from the ICDAR2013 sample pages and used as input images to the model. Figure 5 A illustrates a cropped table from the ICDAR2013 set, and shows that the model splits merged cells (as it is not designed to process such table configurations), and is unable to separate table rows due to the low inter-row spacing (Figure 5 B). Removing the top merged cells as well as adding additional white space buffers between rows helps to improve the table structure estimate provided by the model (Figure 5 C).

Many tables in the ICDAR2013 table competition set have a much more complex structure than that shown in figure 5 A and significantly deviate from the four considered table configurations, hence a full performance assessment is not attempted here, and only some identified model sensitivities are reported below:

  • table location on the image,

  • low or high intra-row and intra-column spacing,

  • sensitivity to presence of empty cells.

Processing the ICDAR2013 tables will require training on more realistic generated tables or empirical tables (e.g. from ICDAR2013 train set) with corresponding paired table skeletons (which ICDAR2013 does not have)l . Additionally, it will require accounting for merged or split cells in the projection or/and optimisation algorithm steps.

Figure 5: Table structure estimation example for a cropped out table A from ICDAR2013 table competition test set Orsi. (2013): B - estimate based on A as input table, C - estimate based on a modified input table A. The estimated structures are shown in blue.

3.1 Robustness with respect to skewed input images

Robustness with respect to skewed input images is an important aspect for the practical applicability of the table extraction technique. Figure 6 shows that the performance measured as percentage of error-free table structures quickly drops and reaches practically zero for absolute angle values greater than 5 degrees. Here, error-free table structures means that both the number of rows and number of columns have been identified correctly. Furthermore, the pixel error defined as a sum of average error in row heights (in pixels) and average error in column widths(in pixels) increases to a maximum value for rotation angles different from 0, this maximum value corresponds to average cell width and height.


Figure 6: Performance of the table extraction algorithm as a function of rotation angle when rotated image is not corrected (left) and when the image is corrected with deskew pre-processing (right). The performance is measured by percentage of error-free table structures (top row), and by ‘pixel error’ for row heights and column widths (bottom row).

Performing the deskew pre-processing as described in section 2.5 allows recovery or nearly full performance for the range of tested input angles (see two right-hand-side panels of figure 6). The decreasing precision of the deskew algorithm is reflected by small drops in performance for skewing angles above 20 degrees (and below -20 degrees) as observed in Fig. 6.

Lastly, a close inspection of the back-rotated image reveals that the lines loose sharpness, and hence might have slightly different characteristics than the tables used to train the model.

4 Conclusions

This study addresses the problem of estimating table structure from scanned table images. The problem arises in many applications working with paper documents or document scans, when structured layout has to be identified and converted into a database entry, or used for different information extraction tasks. The problem can have several valid solutions, as table row and column divider positions can be slightly perturbed and still accurately separate cell contents, due to padding around the dividers between rows and columns. For simplicity, the study focuses on a class of tables that does not contain merged or split cells; a further extension to a wider class of tables is theoretically possible, but left as an item for future research.

The problem is re-formulated as an image-to-genotype translation, and is solved in two stages: 1) translating an input table image into a corresponding table ‘skeleton’ using cGAN, and 2) fitting a latent table structure (table genotype) to the derived table ‘skeleton’ using the xy-cut method, and, optionally, Genetic Algorithm optimisation. When opted for, GA utilises a cGAN generated table structure skeleton for parameterising GA objective function that makes the function specific for each table image.

The conditional GAN model is trained on pairs of table images and corresponding table structure ‘skeletons’ where text is masked out and any missing row/column dividers are shown as blurry lines. This is different from the setup used in Le Vine et al. (2019), when table structure skeletons are thin black lines of the same colour (‘non-blurry’ lines). The training set is generated by a random table generator and includes four distinct table configurations that differ by font size, column/row spacing, number of columns and rows, and cell content length. The derived solution has the following properties:

  • Using blurry column/row separators leads to more robust and stable model behaviour, as compared to the model trained on non-blurry table structure skeletons.

  • Simpler cGAN structure outperforms more complex cGAN structures with a deeper discriminator network.

  • The solution is adaptable to different table types, and table structure estimation quality is consistently high for all four considered table configurations.

  • Reducing the size of the 4,000-table training set to 10% or 1% of the original training set size (i.e. hundreds or dozens of tables) had no or limited effect on the prediction quality.

  • The cGAN model is sensitive to table and cell specifics, in particular large variations in row/column padding, word spacing, and table location on a page.

  • Given the limitations of the xy-cut method as a projection-based method, tables with split/merged cells would require an optimisation method (such as GA) to further refine column and row dividers.

  • A pre-processing in form of deskewing allows high quality estimates when working with rotated table images, e.g. that are skewed due to document scanning.

The approach appears promising, based on the high model adaptability to multiple table configurations that we demonstrate, and the low demands on training set size. To be used in practice, the method will need a training set with pairs of realistic tables and corresponding table structure skeletons. Furthermore, working with more complex table structure, e.g. with split/merge cells, would require additions to table genomes as well as a further optimisation of the xy-cut produced table estimates, for example by Genetic Algorithm. Lastly, the solution could be combined with an algorithm to detect the table area for tables in documents surrounded by text, as well as combined with an OCR algorithm to extract text into a data structure.

5 References

  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.1.2.
  • A. Gupta, D. Tiwari, T. Khurana, and S. Das (2019) Table detection and metadata extraction in document images. In Smart Innovations in Communication and Computational Sciences, pp. 361–372. Cited by: §1.2.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 5967–5976. Cited by: §1.4, §2.1.1, §2.1.1, §2.1.1, §2.1.2, §2.1.2, §2, §3.
  • ITU-R (2011) Recommendation itu-r bt.601-7: studio encoding parameters of digital television for standard 4:3 and wide screen 16:9 aspect ratios. International Communications Union. Cited by: §2.3.1.
  • T. Kasar, P. Barlas, S. Adam, C. Chatelain, and T. Paquet (2013) Learning to detect tables in scanned document images using line information. In ICDAR, pp. 1185–1189. Cited by: §1.2.
  • A. R. Katti, C. Reisswig, C. Guder, S. Brarda, S. Bickel, J. Höhne, and J. B. Faddoul (2018) Chargrid: towards understanding 2d documents. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    pp. 4459–4469. Cited by: §1.6.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.1.2.
  • S. Klampfl, K. Jack, and R. Kern (2014) A comparison of two unsupervised table recognition methods from digital scientific articles. D-Lib Magazine. Cited by: §1.2.
  • M. Lars, G. Andreas, and N. Sebastian (2018) Which training methods for gans do actually converge?.

    International Conference on Machine Learning, arXiv preprint arXiv: arXiv:1801.04406v4

    Cited by: §1.5.
  • N. Le Vine, M. Zeigenfuse, and M. Rowan (2019) Extracting tables from documents using conditional generative adversarial networks and genetic algorithms. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1.1, §2.2.2, §2.2.2, §2.3.2, §2.3.2, §2.3.2, §3, §3, §4.
  • C. Li and M. Wand (2016) Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pp. 702–716. Cited by: §2.1.1.
  • I. S. LLC (2019) ImageMagic. Note: https://imagemagick.orgAccessed: 2019-09-20 Cited by: §2.5.
  • M. Mauder (2019) Deskew. Note: 2019-09-20 Cited by: §2.5.
  • G. Nagy and S. Seth (1984) Hierarchical representation of optically scanned documents. In Proc. 7th Int. Conf. Pattern Recognition, pp. 347–349. Cited by: §1.4, §2.3.1, §2.
  • G. Orsi. (2013) ICDAR 2013 table competition. In 2013 12th International Conference on Document Analysis and Recognition, pp. 1449–1453. Cited by: Figure 5, §3, §3.
  • D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544. Cited by: §2.1.2.
  • D. Pinto, A. McCallum, X. Wei, and W. B. Croft (2003) Table extraction using conditional random fields. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 235–242. Cited by: §1.2.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.1.1.
  • S. Tim, G. Ian, Z. Wojciech, C. Vicki, R. Alec, and C. Xi (2016) Improved techniques for training gans. arXiv preprint arXiv:1606.03498. Cited by: §1.5.
  • D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §2.1.2.
  • X. Wang (1996) Tabular abstraction, editing, and formatting. Ph.D. Thesis, University of Waterloo. Cited by: §1.2.