Going Beyond Nouns With Vision Language Models Using Synthetic Data

03/30/2023
by   Paola Cascante-Bonilla, et al.
0

Large-scale pre-trained Vision Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (VLC) that go 'beyond nouns' such as the meaning of non-object words (e.g., attributes, actions, relations, states, etc.), or difficulty in performing compositional reasoning such as understanding the significance of the order of the words in a sentence. In this work, we investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings without compromising their zero-shot capabilities. We contribute Synthetic Visual Concepts (SyViC) - a million-scale synthetic dataset and data generation codebase allowing to generate additional suitable data to improve VLC understanding and compositional reasoning of VL models. Additionally, we propose a general VL finetuning strategy for effectively leveraging SyViC towards achieving these improvements. Our extensive experiments and ablations on VL-Checklist, Winoground, and ARO benchmarks demonstrate that it is possible to adapt strong pre-trained VL models with synthetic data significantly enhancing their VLC understanding (e.g. by 9.9 zero-shot accuracy.

READ FULL TEXT

page 1

page 15

page 16

page 17

page 18

page 19

research
02/07/2023

Boosting Zero-shot Classification with Synthetic Data Diversity via Stable Diffusion

Recent research has shown it is possible to perform zero-shot classifica...
research
05/10/2023

Incorporating Structured Representations into Pretrained Vision Language Models Using Scene Graphs

Vision and Language (VL) models have demonstrated remarkable zero-shot p...
research
11/21/2022

Teaching Structured Vision Language Concepts to Vision Language Models

Vision and Language (VL) models have demonstrated remarkable zero-shot p...
research
07/23/2022

Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models

We study open-world 3D scene understanding, a family of tasks that requi...
research
03/24/2021

Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2

Thinking aloud is an effective meta-cognitive strategy human reasoners a...
research
08/09/2021

Zero in on Shape: A Generic 2D-3D Instance Similarity Metric learned from Synthetic Data

We present a network architecture which compares RGB images and untextur...
research
07/10/2023

Large Language Models as General Pattern Machines

We observe that pre-trained large language models (LLMs) are capable of ...

Please sign up or login with your details

Forgot password? Click here to reset