Incorporating Structured Representations into Pretrained Vision Language Models Using Scene Graphs

05/10/2023
by   Roei Herzig, et al.
4

Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, recent studies have shown that even the best VL models struggle to capture aspects of scene understanding, such as object attributes, relationships, and action states. In contrast, obtaining structured annotations, e.g., scene graphs (SGs) that could improve these models is time-consuming, costly, and tedious, and thus cannot be used on a large scale. Here we ask, can small datasets containing SG annotations provide sufficient information for enhancing structured understanding of VL models? We show that it is indeed possible to improve VL models using such data by utilizing a specialized model architecture and a new training paradigm. Our approach captures structure-related information for both the visual and textual encoders by directly supervising both components when learning from SG labels. We use scene graph supervision to generate fine-grained captions based on various graph augmentations highlighting different compositional aspects of the scene, and to predict SG information using an open vocabulary approach by adding special “Adaptive SG tokens” to the visual encoder. Moreover, we design a new adaptation technique tailored specifically to the SG tokens that allows better learning of the graph prediction task while still maintaining zero-shot capabilities. Our model shows strong performance improvements on the Winoground and VL-checklist datasets with only a mild degradation in zero-shot performance.

READ FULL TEXT

page 5

page 17

page 19

page 20

page 21

research
11/21/2022

Teaching Structured Vision Language Concepts to Vision Language Models

Vision and Language (VL) models have demonstrated remarkable zero-shot p...
research
03/30/2023

Going Beyond Nouns With Vision Language Models Using Synthetic Data

Large-scale pre-trained Vision Language (VL) models have shown remar...
research
06/02/2023

Unifying (Machine) Vision via Counterfactual World Modeling

Leading approaches in machine vision employ different architectures for ...
research
03/03/2021

Energy-Based Learning for Scene Graph Generation

Traditional scene graph generation methods are trained using cross-entro...
research
07/11/2020

Generative Graph Perturbations for Scene Graph Prediction

Inferring objects and their relationships from an image is useful in man...
research
05/17/2020

Graph Density-Aware Losses for Novel Compositions in Scene Graph Generation

Scene graph generation (SGG) aims to predict graph-structured descriptio...
research
05/27/2023

FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

Textual scene graph parsing has become increasingly important in various...

Please sign up or login with your details

Forgot password? Click here to reset