There is a Time and Place for Reasoning Beyond the Image

03/01/2022
by   XingYu Fu, et al.
0

Images are often more significant than only the pixels to human eyes, as we can infer, associate, and reason with contextual information from other sources to establish a more complete picture. For example, in Figure 1, we can find a way to identify the news articles related to the picture through segment-wise understandings on the signs, the buildings, the crowds, and more. This tells us the time when and the location where the image is taken, which will help us in subsequent tasks, such as evidence retrieval for criminal activities, automatic storyline construction, and upper-stream processing such as image clustering. In this work, we formulate this problem and introduce TARA: a dataset with 16k images with their associated news, time and location automatically extracted from New York Times (NYT), and an additional 61k examples as distant supervision from WIT. On top of the extractions, we present a crowdsourced subset in which images are believed to be feasible to find their spatio-temporal information for evaluation purpose. We show that there exists a 70 slightly filled by our proposed model that uses segment-wise reasoning, motivating higher-level vision-language joint models that can conduct open-ended reasoning with world knowledge.

READ FULL TEXT

page 1

page 2

page 4

page 6

page 7

research
02/02/2023

QR-CLIP: Introducing Explicit Open-World Knowledge for Location and Time Reasoning

Daily images may convey abstract meanings that require us to memorize an...
research
07/12/2023

Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning

Vision-Language Models (VLMs) are expected to be capable of reasoning wi...
research
04/27/2022

TimeBERT: Enhancing Pre-Trained Language Representations with Temporal Information

Time is an important aspect of text documents, which has been widely exp...
research
04/02/2019

Good News, Everyone! Context driven entity-aware captioning for news images

Current image captioning systems perform at a merely descriptive level, ...
research
11/14/2016

Lost in Space: Geolocation in Event Data

Extracting the "correct" location information from text data, i.e., dete...
research
11/20/2021

An End-to-End Framework for Dynamic Crime Profiling of Places

Much effort is being made to ensure the safety of people. One of the mai...
research
08/21/2020

INSIDE: Steering Spatial Attention with Non-Imaging Information in CNNs

We consider the problem of integrating non-imaging information into segm...

Please sign up or login with your details

Forgot password? Click here to reset