Discovering A Variety of Objects in Spatio-Temporal Human-Object Interactions

11/14/2022
by   Yong-Lu Li, et al.
0

Spatio-temporal Human-Object Interaction (ST-HOI) detection aims at detecting HOIs from videos, which is crucial for activity understanding. In daily HOIs, humans often interact with a variety of objects, e.g., holding and touching dozens of household items in cleaning. However, existing whole body-object interaction video benchmarks usually provide limited object classes. Here, we introduce a new benchmark based on AVA: Discovering Interacted Objects (DIO) including 51 interactions and 1,000+ objects. Accordingly, an ST-HOI learning task is proposed expecting vision systems to track human actors, detect interactions and simultaneously discover interacted objects. Even though today's detectors/trackers excel in object detection/tracking tasks, they perform unsatisfied to localize diverse/unseen objects in DIO. This profoundly reveals the limitation of current vision systems and poses a great challenge. Thus, how to leverage spatio-temporal cues to address object discovery is explored, and a Hierarchical Probe Network (HPN) is devised to discover interacted objects utilizing hierarchical spatio-temporal human/context cues. In extensive experiments, HPN demonstrates impressive performance. Data and code are available at https://github.com/DirtyHarryLYL/HAKE-AVA.

READ FULL TEXT

page 4

page 5

page 6

page 13

page 15

page 18

research
06/23/2022

Explore Spatio-temporal Aggregation for Insubstantial Object Detection: Benchmark Dataset and Baseline

We endeavor on a rarely explored task named Insubstantial Object Detecti...
research
12/17/2020

LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos

Analyzing the interactions between humans and objects from a video inclu...
research
08/06/2023

InterTracker: Discovering and Tracking General Objects Interacting with Hands in the Wild

Understanding human interaction with objects is an important research to...
research
06/01/2023

Object pop-up: Can we infer 3D objects and their poses from human interactions alone?

The intimate entanglement between objects affordances and human poses is...
research
12/04/2018

Classifying Collisions with Spatio-Temporal Action Graph Networks

Events defined by the interaction of objects in a scene often are of cri...
research
11/09/2021

Video Text Tracking With a Spatio-Temporal Complementary Model

Text tracking is to track multiple texts in a video,and construct a traj...
research
04/10/2019

Next-Active-Object prediction from Egocentric Videos

Although First Person Vision systems can sense the environment from the ...

Please sign up or login with your details

Forgot password? Click here to reset