Knowledge-rich Image Gist Understanding Beyond Literal Meaning

04/18/2019
by   Lydia Weiland, et al.
0

We investigate the problem of understanding the message (gist) conveyed by images and their captions as found, for instance, on websites or news articles. To this end, we propose a methodology to capture the meaning of image-caption pairs on the basis of large amounts of machine-readable knowledge that has previously been shown to be highly effective for text understanding. Our method identifies the connotation of objects beyond their denotation: where most approaches to image understanding focus on the denotation of objects, i.e., their literal meaning, our work addresses the identification of connotations, i.e., iconic meanings of objects, to understand the message of images. We view image understanding as the task of representing an image-caption pair on the basis of a wide-coverage vocabulary of concepts such as the one provided by Wikipedia, and cast gist detection as a concept-ranking problem with image-caption pairs as queries. To enable a thorough investigation of the problem of gist understanding, we produce a gold standard of over 300 image-caption pairs and over 8,000 gist annotations covering a wide variety of topics at different levels of abstraction. We use this dataset to experimentally benchmark the contribution of signals from heterogeneous sources, namely image and text. The best result with a Mean Average Precision (MAP) of 0.69 indicate that by combining both dimensions we are able to better understand the meaning of our image-caption pairs than when using language or vision information alone. We test the robustness of our gist detection approach when receiving automatically generated input, i.e., using automatically generated image tags or generated captions, and prove the feasibility of an end-to-end automated process.

READ FULL TEXT
research
01/05/2023

ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions

Advancements in Text-to-Image synthesis over recent years have focused m...
research
11/29/2022

Language-driven Open-Vocabulary 3D Scene Understanding

Open-vocabulary scene understanding aims to localize and recognize unsee...
research
01/15/2021

Catching Out-of-Context Misinformation with Self-supervised Learning

Despite the recent attention to DeepFakes and other forms of image manip...
research
08/01/2023

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Open-world instance-level scene understanding aims to locate and recogni...
research
12/31/2015

Event Specific Multimodal Pattern Mining with Image-Caption Pairs

In this paper we describe a novel framework and algorithms for discoveri...
research
04/15/2023

Detecting Out-of-Context Multimodal Misinformation with interpretable neural-symbolic model

Recent years have witnessed the sustained evolution of misinformation th...

Please sign up or login with your details

Forgot password? Click here to reset