KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

09/22/2021
by   Yongfei Liu, et al.
5

Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning. Previous mainstream VLP approaches typically adopt a two-step strategy relying on external object detectors to encode images in a multi-modal Transformer framework, which suffer from restrictive object concept space, limited image context and inefficient computation. In this paper, we propose an object-aware end-to-end VLP framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly. More importantly, we propose to perform object knowledge distillation to facilitate learning cross-modal alignment at different semantic levels. To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision: 1.) Object-guided masked vision modeling task focuses on enforcing object-aware representation learning in the multi-modal Transformer; 2.) Phrase-region alignment task aims to improve cross-modal alignment by utilizing the similarities between noun phrases and object labels in the linguistic space. Extensive experiments on a wide range of vision-language tasks demonstrate the efficacy of our proposed framework, and we achieve competitive or superior performances over the existing pretraining strategies. The code is available in supplementary materials.

READ FULL TEXT

page 3

page 6

page 9

page 10

research
01/26/2023

Improving Cross-modal Alignment for Text-Guided Image Inpainting

Text-guided image inpainting (TGII) aims to restore missing regions base...
research
09/19/2023

Improving CLIP Robustness with Knowledge Distillation and Self-Training

This paper examines the robustness of a multi-modal computer vision mode...
research
08/30/2023

Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

In this paper, we for the first time explore helpful multi-modal context...
research
12/08/2021

FLAVA: A Foundational Language And Vision Alignment Model

State-of-the-art vision and vision-and-language models rely on large-sca...
research
01/03/2023

Cross Modal Transformer via Coordinates Encoding for 3D Object Dectection

In this paper, we propose a robust 3D detector, named Cross Modal Transf...
research
04/06/2023

Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce

This paper aims to establish a generic multi-modal foundation model that...
research
06/02/2022

3D-Augmented Contrastive Knowledge Distillation for Image-based Object Pose Estimation

Image-based object pose estimation sounds amazing because in real applic...

Please sign up or login with your details

Forgot password? Click here to reset