DeepAI AI Chat
Log In Sign Up

Understanding Cross-modal Interactions in V L Models that Generate Scene Descriptions

11/09/2022
by   Michele Cafagna, et al.
University of Malta
Utrecht University
0

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.

READ FULL TEXT

page 1

page 3

page 5

page 8

page 14

page 15

page 17

09/25/2020

Are scene graphs good enough to improve Image Captioning?

Many top-performing image captioning models rely solely on object featur...
01/12/2023

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Most approaches to cross-modal retrieval (CMR) focus either on object-ce...
09/15/2021

What Vision-Language Models `See' when they See Scenes

Images can be described in terms of the objects they contain, or in term...
05/28/2010

Using Soft Constraints To Learn Semantic Models Of Descriptions Of Shapes

The contribution of this paper is to provide a semantic model (using sof...
02/13/2023

Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions

Existing language and vision models achieve impressive performance in im...
03/16/2016

DeepContext: Context-Encoding Neural Pathways for 3D Holistic Scene Understanding

While deep neural networks have led to human-level performance on comput...
04/24/2017

Paying Attention to Descriptions Generated by Image Captioning Models

To bridge the gap between humans and machines in image understanding and...