DiT: Self-supervised Pre-training for Document Image Transformer

03/04/2022
by   Junlong Li, et al.
6

Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, as well as table detection. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 → 92.69), document layout analysis (91.0 → 94.9) and table detection (94.23 → 96.55). The code and pre-trained models are publicly available at <https://aka.ms/msdit>.

READ FULL TEXT
research
08/29/2023

Vision Grid Transformer for Document Layout Analysis

Document pre-trained models and grid-based models have proven to be very...
research
04/18/2022

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Self-supervised pre-training techniques have achieved remarkable progres...
research
06/29/2023

Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

Foundation models have exhibited remarkable success in various applicati...
research
09/11/2023

TransferDoc: A Self-Supervised Transferable Document Representation Learning Model Unifying Vision and Language

The field of visual document understanding has witnessed a rapid growth ...
research
12/28/2020

Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks

In this paper, we introduce a fully convolutional network for the docume...
research
06/25/2021

Efficient Document Image Classification Using Region-Based Graph Neural Network

Document image classification remains a popular research area because it...
research
03/09/2022

Uni4Eye: Unified 2D and 3D Self-supervised Pre-training via Masked Image Modeling Transformer for Ophthalmic Image Classification

A large-scale labeled dataset is a key factor for the success of supervi...

Please sign up or login with your details

Forgot password? Click here to reset