ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding

09/18/2022
by   Wenjin Wang, et al.
22

Recent efforts of multimodal Transformers have improved Visually Rich Document Understanding (VrDU) tasks via incorporating visual and textual information. However, existing approaches mainly focus on fine-grained elements such as words and document image patches, making it hard for them to learn from coarse-grained elements, including natural lexical units like phrases and salient visual regions like prominent image regions. In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics, which are valuable for document understanding. At first, a document graph is proposed to model complex relationships among multi-grained multimodal elements, in which salient visual regions are detected by a cluster-based method. Then, a multi-grained multimodal Transformer called mmLayout is proposed to incorporate coarse-grained information into existing pre-trained fine-grained multimodal Transformers based on the graph. In mmLayout, coarse-grained information is aggregated from fine-grained, and then, after further processing, is fused back into fine-grained for final prediction. Furthermore, common sense enhancement is introduced to exploit the semantic information of natural lexical units. Experimental results on four tasks, including information extraction and document question answering, show that our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters. Qualitative analyses show that our method can capture consistent semantics in coarse-grained elements.

READ FULL TEXT
research
08/27/2020

AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

Pre-trained language models such as BERT have exhibited remarkable perfo...
research
10/09/2022

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

Multimodal representation learning has shown promising improvements on v...
research
01/14/2020

Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features

Text contained in an image carries high-level semantics that can be expl...
research
05/28/2020

Combining Fine- and Coarse-Grained Classifiers for Diabetic Retinopathy Detection

Visual artefacts of early diabetic retinopathy in retinal fundus images ...
research
09/10/2021

Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering

Video question answering (VideoQA) is challenging given its multimodal c...
research
05/06/2022

Fine-grained Intent Classification in the Legal Domain

A law practitioner has to go through a lot of long legal case proceeding...
research
03/05/2021

Fine-Grained Off-Road Semantic Segmentation and Mapping via Contrastive Learning

Road detection or traversability analysis has been a key technique for a...

Please sign up or login with your details

Forgot password? Click here to reset