ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

08/31/2023
by   Weihan Wang, et al.
0

Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of vision-language tasks. Prior arts usually focus on how to align visual and textual features, but strategies for improving the robustness of model and speeding up model convergence are left insufficiently explored. In this paper, we propose a novel method ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs. For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model, which alleviates the problem of treating synonyms of masked words as negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input, encouraging the model to learn high-quality representations by increasing the difficulty of the ITM task. By leveraging the above techniques, our ViLTA can achieve better performance on various vision-language tasks. Extensive experiments on benchmark datasets demonstrate that the effectiveness of ViLTA and its promising potential for vision-language pre-training.

READ FULL TEXT

page 9

page 15

research
03/14/2021

SemVLP: Vision-Language Pre-training by Aligning Semantics at Multiple Levels

Vision-language pre-training (VLP) on large-scale image-text pairs has r...
research
03/08/2023

Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Text-based Person Search (TPS), is targeted on retrieving pedestrians to...
research
03/08/2022

Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting

Recently, Vision-Language Pre-training (VLP) techniques have greatly ben...
research
02/09/2022

Image Difference Captioning with Pre-training and Contrastive Learning

The Image Difference Captioning (IDC) task aims to describe the visual d...
research
10/14/2022

Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training

Large-scale vision-language pre-trained (VLP) models are prone to halluc...
research
02/14/2023

UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training

This work presents a unified knowledge protocol, called UKnow, which fac...
research
04/17/2023

CLIP-Lung: Textual Knowledge-Guided Lung Nodule Malignancy Prediction

Lung nodule malignancy prediction has been enhanced by advanced deep-lea...

Please sign up or login with your details

Forgot password? Click here to reset