DeepAI AI Chat
Log In Sign Up

Understanding Cross-modal Interactions in V L Models that Generate Scene Descriptions

by   Michele Cafagna, et al.
University of Malta
Utrecht University

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state-of-the-art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.


page 1

page 3

page 5

page 8

page 14

page 15

page 17


Are scene graphs good enough to improve Image Captioning?

Many top-performing image captioning models rely solely on object featur...

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study

Most approaches to cross-modal retrieval (CMR) focus either on object-ce...

What Vision-Language Models `See' when they See Scenes

Images can be described in terms of the objects they contain, or in term...

Using Soft Constraints To Learn Semantic Models Of Descriptions Of Shapes

The contribution of this paper is to provide a semantic model (using sof...

Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions

Existing language and vision models achieve impressive performance in im...

DeepContext: Context-Encoding Neural Pathways for 3D Holistic Scene Understanding

While deep neural networks have led to human-level performance on comput...

Paying Attention to Descriptions Generated by Image Captioning Models

To bridge the gap between humans and machines in image understanding and...