Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions

02/13/2023
by   Henrik Voigt, et al.
0

Existing language and vision models achieve impressive performance in image-text understanding. Yet, it is an open question to what extent they can be used for language understanding in 3D environments and whether they implicitly acquire 3D object knowledge, e.g. about different views of an object. In this paper, we investigate whether a state-of-the-art language and vision model, CLIP, is able to ground perspective descriptions of a 3D object and identify canonical views of common objects based on text queries. We present an evaluation framework that uses a circling camera around a 3D object to generate images from different viewpoints and evaluate them in terms of their similarity to natural language descriptions. We find that a pre-trained CLIP model performs poorly on most canonical views and that fine-tuning using hard negative sampling and random contrasting yields good results even under conditions with little available training data.

READ FULL TEXT

page 1

page 3

page 7

page 8

page 14

page 15

page 16

research
08/16/2023

Painter: Teaching Auto-regressive Language Models to Draw Sketches

Large language models (LLMs) have made tremendous progress in natural la...
research
05/12/2023

ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4

In recent years, large language models (LLMs) have made significant prog...
research
05/30/2019

Grounding Language Attributes to Objects using Bayesian Eigenobjects

We develop a system to disambiguate objects based on simple physical des...
research
11/09/2022

Understanding Cross-modal Interactions in V L Models that Generate Scene Descriptions

Image captioning models tend to describe images in an object-centric way...
research
09/13/2023

Language-Conditioned Observation Models for Visual Object Search

Object search is a challenging task because when given complex language ...
research
02/23/2018

Interpretable Charge Predictions for Criminal Cases: Learning to Generate Court Views from Fact Descriptions

In this paper, we propose to study the problem of COURT VIEW GENeration ...
research
06/19/2023

Generating Parametric BRDFs from Natural Language Descriptions

Artistic authoring of 3D environments is a laborious enterprise that als...

Please sign up or login with your details

Forgot password? Click here to reset