ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

06/30/2020
by   Fei Yu, et al.
10

We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint representations of vision and language. ERNIE-ViL tries to construct the detailed semantic connections (objects, attributes of objects and relationships between objects in visual scenes) across vision and language, which are essential to vision-language cross-modal tasks. Incorporating knowledge from scene graphs, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction in the pre-training phase. More specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can model the joint representation characterizing the alignments of the detailed semantics across vision and language. Pre-trained on two large image-text alignment datasets (Conceptual Captions and SBU), ERNIE-ViL learns better and more robust joint representations. It achieves state-of-the-art performance on 5 vision-language downstream tasks after fine-tuning ERNIE-ViL. Furthermore, it ranked the 1st place on the VCR leader-board with an absolute improvement of 3.7%.

READ FULL TEXT

page 2

page 3

page 9

research
04/29/2022

Vision-Language Pre-Training for Boosting Scene Text Detectors

Recently, vision-language joint representation learning has proven to be...
research
05/08/2022

Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework

Vision-language pre-training (VLP) relying on large-scale pre-training d...
research
03/14/2022

Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer

Pre-trained language models are still far from human performance in task...
research
05/23/2023

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

Contrastively trained vision-language models have achieved remarkable pr...
research
04/03/2023

Probabilistic Prompt Learning for Dense Prediction

Recent progress in deterministic prompt learning has become a promising ...
research
05/23/2023

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Pre-trained vision-language models are the de-facto foundation models fo...
research
08/17/2023

Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes

3D scene understanding has gained significant attention due to its wide ...

Please sign up or login with your details

Forgot password? Click here to reset