Optimized Table Tokenization for Table Structure Recognition

05/05/2023
by   Maksym Lysak, et al.
0

Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2021

On Cropped versus Uncropped Training Sets in Tabular Structure Detection

Automated document processing for tabular information extraction is high...
research
03/16/2023

Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation

Recently, Table Structure Recognition (TSR) task, aiming at identifying ...
research
09/09/2021

MATE: Multi-view Attention for Table Transformer Efficiency

This work presents a sparse-attention Transformer architecture for model...
research
01/31/2023

The Power of External Memory in Increasing Predictive Model Capacity

One way of introducing sparsity into deep networks is by attaching an ex...
research
04/09/2021

Lookup-Table Recurrent Language Models for Long Tail Speech Recognition

We introduce Lookup-Table Language Models (LookupLM), a method for scali...
research
08/24/2020

Table2Charts: Learning Shared Representations for Recommending Charts on Multi-dimensional Data

It is common for people to create different types of charts to explore a...
research
04/07/2021

FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization

Transducer-based models, such as RNN-Transducer and transformer-transduc...

Please sign up or login with your details

Forgot password? Click here to reset