GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

05/24/2023
by   Woojeong Jin, et al.
0

Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks. However, such generalization to vision-language tasks including grounding and generation tasks has been under-explored; existing few-shot VL models struggle to handle tasks that involve object grounding and multiple images such as visual commonsense reasoning or NLVR2. In this paper, we introduce GRILL, GRounded vIsion Language aLigning, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances. Specifically, GRILL learns object grounding and localization by exploiting object-text alignments, which enables it to transfer to grounding tasks in a zero-/few-shot fashion. We evaluate our model on various zero-/few-shot VL tasks and show that it consistently surpasses the state-of-the-art few-shot methods.

READ FULL TEXT

page 1

page 2

page 4

research
08/24/2023

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

We introduce the Qwen-VL series, a set of large-scale vision-language mo...
research
10/24/2022

Language-free Training for Zero-shot Video Grounding

Given an untrimmed video and a language query depicting a specific tempo...
research
09/21/2023

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

3D visual grounding is a critical skill for household robots, enabling t...
research
06/03/2020

CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language Learning

Approaches to Grounded Language Learning typically focus on a single tas...
research
04/18/2021

Language in a (Search) Box: Grounding Language Learning in Real-World Human-Machine Interaction

We investigate grounded language learning through real-world data, by mo...
research
07/24/2023

3D-LLM: Injecting the 3D World into Large Language Models

Large language models (LLMs) and Vision-Language Models (VLMs) have been...
research
05/08/2019

ShapeGlot: Learning Language for Shape Differentiation

In this work we explore how fine-grained differences between the shapes ...

Please sign up or login with your details

Forgot password? Click here to reset