Replacement as a Self-supervision for Fine-grained Vision-language Pre-training

03/09/2023
by   Lisai Zhang, et al.
0

Fine-grained supervision based on object annotations has been widely used for vision and language pre-training (VLP). However, in real-world application scenarios, aligned multi-modal data is usually in the image-caption format, which only provides coarse-grained supervision. It is cost-expensive to collect object annotations and build object annotation pre-extractor for different scenarios. In this paper, we propose a fine-grained self-supervision signal without object annotations from a replacement perspective. First, we propose a homonym sentence rewriting (HSR) algorithm to provide token-level supervision. The algorithm replaces a verb/noun/adjective/quantifier word of the caption with its homonyms from WordNet. Correspondingly, we propose a replacement vision-language modeling (RVLM) framework to exploit the token-level supervision. Two replaced modeling tasks, i.e., replaced language contrastive (RLC) and replaced language modeling (RLM), are proposed to learn the fine-grained alignment. Extensive experiments on several downstream tasks demonstrate the superior performance of the proposed method.

READ FULL TEXT

page 1

page 7

page 8

research
08/04/2022

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Large-scale vision-language pre-training has shown impressive advances i...
research
02/09/2022

Image Difference Captioning with Pre-training and Contrastive Learning

The Image Difference Captioning (IDC) task aims to describe the visual d...
research
04/11/2023

FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training

Fashion vision-language pre-training models have shown efficacy for a wi...
research
08/07/2023

COPA: Efficient Vision-Language Pre-training Through Collaborative Object- and Patch-Text Alignment

Vision-Language Pre-training (VLP) methods based on object detection enj...
research
10/28/2022

Assessing Phrase Break of ESL speech with Pre-trained Language Models

This work introduces an approach to assessing phrase break in ESL learne...
research
05/27/2021

Measuring Fine-Grained Domain Relevance of Terms: A Hierarchical Core-Fringe Approach

We propose to measure fine-grained domain relevance - the degree that a ...
research
10/23/2020

ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding

Coarse-grained linguistic information, such as name entities or phrases,...

Please sign up or login with your details

Forgot password? Click here to reset