Contrastive Language-Image Pre-Training with Knowledge Graphs

10/17/2022
by   Xuran Pan, et al.
0

Recent years have witnessed the fast development of large-scale pre-training frameworks that can extract multi-modal representations in a unified form and achieve promising performances when transferred to downstream tasks. Nevertheless, existing approaches mainly focus on pre-training with simple image-text pairs, while neglecting the semantic connections between concepts from different modalities. In this paper, we propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, which injects semantic information into the widely used CLIP model. Through introducing knowledge-based objectives in the pre-training process and utilizing different types of knowledge graphs as training data, our model can semantically align the representations in vision and language with higher quality, and enhance the reasoning ability across scenarios and modalities. Extensive experiments on various vision-language downstream tasks demonstrate the effectiveness of Knowledge-CLIP compared with the original CLIP and competitive baselines.

READ FULL TEXT

page 2

page 5

research
04/27/2023

Retrieval-based Knowledge Augmented Vision Language Pre-training

With recent progress in large-scale vision and language representation l...
research
02/14/2023

UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training

This work presents a unified knowledge protocol, called UKnow, which fac...
research
05/03/2023

Learngene: Inheriting Condensed Knowledge from the Ancestry Model to Descendant Models

During the continuous evolution of one organism's ancestry, its genes ac...
research
09/15/2022

Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Medical vision-and-language pre-training (Med-VLP) has received consider...
research
07/26/2022

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Large-scale multi-modal contrastive pre-training has demonstrated great ...
research
08/18/2023

Artificial-Spiking Hierarchical Networks for Vision-Language Representation Learning

With the success of self-supervised learning, multimodal foundation mode...
research
02/11/2023

Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis

Often, deep network models are purely inductive during training and whil...

Please sign up or login with your details

Forgot password? Click here to reset