STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning

09/13/2023
by   Palaash Agrawal, et al.
0

Understanding relations between objects is crucial for understanding the semantics of a visual scene. It is also an essential step in order to bridge visual and language models. However, current state-of-the-art computer vision models still lack the ability to perform spatial reasoning well. Existing datasets mostly cover a relatively small number of spatial relations, all of which are static relations that do not intrinsically involve motion. In this paper, we propose the Spatial and Temporal Understanding of Prepositions Dataset (STUPD) – a large-scale video dataset for understanding static and dynamic spatial relationships derived from prepositions of the English language. The dataset contains 150K visual depictions (videos and images), consisting of 30 distinct spatial prepositional senses, in the form of object interaction simulations generated synthetically using Unity3D. In addition to spatial relations, we also propose 50K visual depictions across 10 temporal relations, consisting of videos depicting event/time-point interactions. To our knowledge, no dataset exists that represents temporal relations through visual settings. In this dataset, we also provide 3D information about object interactions such as frame-wise coordinates, and descriptions of the objects used. The goal of this synthetic dataset is to help models perform better in visual relationship detection in real-world settings. We demonstrate an increase in the performance of various models over 2 real-world datasets (ImageNet-VidVRD and Spatial Senses) when pretrained on the STUPD dataset, in comparison to other pretraining datasets.

READ FULL TEXT

page 6

page 18

page 19

page 20

page 21

research
05/25/2021

ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

Detecting human-object interactions (HOI) is an important step toward a ...
research
01/04/2018

Object Referring in Videos with Language and Human Gaze

We investigate the problem of object referring (OR) i.e. to localize a t...
research
08/07/2019

SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition

Understanding the spatial relations between objects in images is a surpr...
research
02/26/2023

MoReVis: A Visual Summary for Spatiotemporal Moving Regions

Spatial and temporal interactions are central and fundamental in many ac...
research
07/19/2020

Understanding Spatial Relations through Multiple Modalities

Recognizing spatial relations and reasoning about them is essential in m...
research
10/10/2019

CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning

Computer vision has undergone a dramatic revolution in performance, driv...
research
11/05/2019

Spatial-Temporal Cluster Relations – A Foundation for Trajectory Cluster Lifetime Analysis

Spatial-temporal data, that is information about objects that exist at a...

Please sign up or login with your details

Forgot password? Click here to reset