Understanding Cross-modal Interactions in V L Models that Generate Scene Descriptions

11/09/2022
by   Michele Cafagna, et al.
0

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.

READ FULL TEXT

page 1

page 3

page 5

page 8

page 14

page 15

page 17

research
09/25/2020

Are scene graphs good enough to improve Image Captioning?

Many top-performing image captioning models rely solely on object featur...
research
01/12/2023

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Most approaches to cross-modal retrieval (CMR) focus either on object-ce...
research
09/15/2021

What Vision-Language Models `See' when they See Scenes

Images can be described in terms of the objects they contain, or in term...
research
06/09/2023

Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions

What constitutes the "vibe" of a particular scene? What should one find ...
research
05/28/2010

Using Soft Constraints To Learn Semantic Models Of Descriptions Of Shapes

The contribution of this paper is to provide a semantic model (using sof...
research
03/16/2016

DeepContext: Context-Encoding Neural Pathways for 3D Holistic Scene Understanding

While deep neural networks have led to human-level performance on comput...
research
02/13/2023

Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions

Existing language and vision models achieve impressive performance in im...

Please sign up or login with your details

Forgot password? Click here to reset