Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

06/27/2022
by   Chuwei Luo, et al.
0

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard benchmarks for VrDU, the way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy. In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals. Specifically, a pre-training paradigm called Bi-VLDoc is proposed, in which a bidirectional vision-language supervision strategy and a vision-language hybrid-attention mechanism are devised to fully explore and utilize the interactions between these two modalities, to learn stronger cross-modal document representations with richer semantics. Benefiting from the learned informative cross-modal document representations, Bi-VLDoc significantly advances the state-of-the-art performance on three widely-used document understanding benchmarks, including Form Understanding (from 85.14 (from 96.01 On Document Visual QA, Bi-VLDoc achieves the state-of-the-art performance compared to previous single model methods.

READ FULL TEXT

page 1

page 7

page 8

research
04/29/2022

Vision-Language Pre-Training for Boosting Scene Text Detectors

Recently, vision-language joint representation learning has proven to be...
research
03/01/2023

Cross-Modal Entity Matching for Visually Rich Documents

Visually rich documents (VRD) are physical/digital documents that utiliz...
research
09/11/2023

TransferDoc: A Self-Supervised Transferable Document Representation Learning Model Unifying Vision and Language

The field of visual document understanding has witnessed a rapid growth ...
research
04/28/2023

Information Redundancy and Biases in Public Document Information Extraction Benchmarks

Advances in the Visually-rich Document Understanding (VrDU) field and pa...
research
04/18/2021

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has achieved SOTA p...
research
09/10/2021

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

While large scale pre-training has achieved great achievements in bridgi...
research
01/08/2022

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is an important research topic across multim...

Please sign up or login with your details

Forgot password? Click here to reset