ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

05/25/2021
by   Meng-Jiun Chiou, et al.
6

Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to sub-optimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.

READ FULL TEXT

page 1

page 4

page 8

research
09/13/2023

STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning

Understanding relations between objects is crucial for understanding the...
research
11/08/2019

Extracting temporal features into a spatial domain using autoencoders for sperm video analysis

In this paper, we present a two-step deep learning method that is used t...
research
06/29/2021

Spatio-Temporal Context for Action Detection

Research in action detection has grown in the recentyears, as it plays a...
research
11/19/2020

Towards Spatio-Temporal Video Scene Text Detection via Temporal Clustering

With only bounding-box annotations in the spatial domain, existing video...
research
10/25/2021

Where were my keys? – Aggregating Spatial-Temporal Instances of Objects for Efficient Retrieval over Long Periods of Time

Robots equipped with situational awareness can help humans efficiently f...
research
05/03/2019

DeepSignals: Predicting Intent of Drivers Through Visual Signals

Detecting the intention of drivers is an essential task in self-driving,...
research
01/30/2020

Visual Exploration of Movement Relatedness for Multi-species Ecology Analysis

Advances in GPS telemetry technology have enabled analysis of animal mov...

Please sign up or login with your details

Forgot password? Click here to reset