DisCLIP: Open-Vocabulary Referring Expression Generation

05/30/2023
by   Lior Bracha, et al.
0

Referring Expressions Generation (REG) aims to produce textual descriptions that unambiguously identifies specific objects within a visual scene. Traditionally, this has been achieved through supervised learning methods, which perform well on specific data distributions but often struggle to generalize to new images and concepts. To address this issue, we present a novel approach for REG, named DisCLIP, short for discriminative CLIP. We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image while avoiding other distracting concepts. Notably, this optimization happens at inference time and does not require additional training or tuning of learned parameters. We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene. To achieve this, we use a frozen zero-shot comprehension module as a critique of our generated referring expressions. We evaluate DisCLIP on multiple referring expression benchmarks through human evaluation and show that it significantly outperforms previous methods on out-of-domain datasets. Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.

READ FULL TEXT

page 1

page 7

page 8

page 13

page 14

research
11/07/2015

Generation and Comprehension of Unambiguous Object Descriptions

We propose a method that can generate an unambiguous description (known ...
research
01/12/2017

Comprehension-guided referring expressions

We consider generation and comprehension of natural language referring e...
research
07/28/2023

Cross-Modal Concept Learning and Inference for Vision-Language Models

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, est...
research
05/16/2022

Referring Expressions with Rational Speech Act Framework: A Probabilistic Approach

This paper focuses on a referring expression generation (REG) task in wh...
research
05/24/2023

Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples

NLP tasks are typically defined extensionally through datasets containin...
research
08/19/2023

Whether you can locate or not? Interactive Referring Expression Generation

Referring Expression Generation (REG) aims to generate unambiguous Refer...
research
06/09/2023

Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions

What constitutes the "vibe" of a particular scene? What should one find ...

Please sign up or login with your details

Forgot password? Click here to reset