VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

05/24/2022
by   Souhail Bakkali, et al.
0

Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream approach. In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues, considering intra- and inter-modality relationships. Instead of merging features from different modalities into a common representation space, the proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities. The proposed learning objective is devised between intra- and inter-modality alignment tasks, where the similarity distribution per task is computed by contracting positive sample pairs while simultaneously contrasting negative ones in the common feature representation space. Extensive experiments on public document classification datasets demonstrate the effectiveness and the generalization capacity of our model on both low-scale and large-scale datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/14/2021

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

Vision-language pre-training (VLP) on large-scale image-text pairs has r...
research
02/21/2022

Vision-Language Pre-Training with Triple Contrastive Learning

Vision-language representation learning largely benefits from image-text...
research
03/25/2022

Versatile Multi-Modal Pre-Training for Human-Centric Perception

Human-centric perception plays a vital role in vision and graphics. But ...
research
02/26/2021

A Universal Model for Cross Modality Mapping by Relational Reasoning

With the aim of matching a pair of instances from two different modaliti...
research
05/11/2023

Continual Vision-Language Representation Learning with Off-Diagonal Information

This paper discusses the feasibility of continuously training the CLIP m...
research
09/16/2023

Delving into Multimodal Prompting for Fine-grained Visual Classification

Fine-grained visual classification (FGVC) involves categorizing fine sub...
research
04/30/2020

Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO

Image captioning datasets have proven useful for multimodal representati...

Please sign up or login with your details

Forgot password? Click here to reset