DUBLIN – Document Understanding By Language-Image Network

05/23/2023
by   Kriti Aggarwal, et al.
0

Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images. Existing models often rely on manual feature engineering or domain-specific pipelines, which limit their generalization ability across different document types and languages. In this paper, we propose DUBLIN, which is pretrained on web pages using three novel objectives: Masked Document Content Generation Task, Bounding Box Task, and Rendered Question Answering Task, that leverage both the spatial and semantic information in the document images. Our model achieves competitive or state-of-the-art results on several benchmarks, such as Web-Based Structural Reading Comprehension, Document Visual Question Answering, Key Information Extraction, Diagram Understanding, and Table Question Answering. In particular, we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms the current pixel-based SoTA models on DocVQA and AI2D datasets by 2 respectively. Also, DUBLIN is the first ever pixel-based model which achieves comparable performance to text-based SoTA methods on XFUND dataset for Semantic Entity Recognition showcasing its multilingual capability. Moreover, we create new baselines for text-based datasets by rendering them as document images to promote research in this direction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2023

PDFVQA: A New Dataset for Real-World VQA on PDF Documents

Document-based Visual Question Answering examines the document understan...
research
07/01/2020

DocVQA: A Dataset for VQA on Document Images

We present a new dataset for Visual Question Answering on document image...
research
08/20/2020

Document Visual Question Answering Challenge 2020

This paper presents results of Document Visual Question Answering Challe...
research
05/24/2023

Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

Advances in Large Language Models (LLMs) have inspired a surge of resear...
research
04/21/2023

Information Extraction from Documents: Question Answering vs Token Classification in real-world setups

Research in Document Intelligence and especially in Document Key Informa...
research
07/09/2021

Form2Seq : A Framework for Higher-Order Form Structure Extraction

Document structure extraction has been a widely researched area for deca...
research
04/28/2023

Information Redundancy and Biases in Public Document Information Extraction Benchmarks

Advances in the Visually-rich Document Understanding (VrDU) field and pa...

Please sign up or login with your details

Forgot password? Click here to reset