Hierarchical multimodal transformers for Multi-Page DocVQA

12/07/2022
by   Ruben Tito, et al.
0

Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where questions are posed over multi-page documents instead of single pages. Second, we propose a new hierarchical method, Hi-VT5, based on the T5 architecture, that overcomes the limitations of current methods to process long multi-page documents. The proposed method is based on a hierarchical transformer architecture where the encoder summarizes the most relevant information of every page and then, the decoder takes this summarized information to generate the final answer. Through extensive experimentation, we demonstrate that our method is able, in a single stage, to answer the questions and provide the page that contains the relevant information to find the answer, which can be used as a kind of explainability measure.

READ FULL TEXT

page 5

page 7

page 8

page 13

page 14

research
04/27/2021

Document Collection Visual Question Answering

Current tasks and methods in Document Understanding aims to process docu...
research
03/24/2023

HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of Document Structures

The problem of document structure reconstruction refers to converting di...
research
06/24/2015

Unshredding of Shredded Documents: Computational Framework and Implementation

A shredded document D is a document whose pages have been cut into strip...
research
01/29/2020

ScreenTrack: Using a Visual History of a Computer Screen to Retrieve Documents and Web Pages

Computers are used for various purposes, so frequent context switching i...
research
10/10/2017

DocEmul: a Toolkit to Generate Structured Historical Documents

We propose a toolkit to generate structured synthetic documents emulatin...
research
04/05/2023

Context-Aware Classification of Legal Document Pages

For many business applications that require the processing, indexing, an...
research
07/02/2022

Sequence-aware multimodal page classification of Brazilian legal documents

The Brazilian Supreme Court receives tens of thousands of cases each sem...

Please sign up or login with your details

Forgot password? Click here to reset