Egocentric Object Manipulation Graphs

06/05/2020
by   Eadom Dessalene, et al.
14

We introduce Egocentric Object Manipulation Graphs (Ego-OMG) - a novel representation for activity modeling and anticipation of near future actions integrating three components: 1) semantic temporal structure of activities, 2) short-term dynamics, and 3) representations for appearance. Semantic temporal structure is modeled through a graph, embedded through a Graph Convolutional Network, whose states model characteristics of and relations between hands and objects. These state representations derive from all three levels of abstraction, and span segments delimited by the making and breaking of hand-object contact. Short-term dynamics are modeled in two ways: A) through 3D convolutions, and B) through anticipating the spatiotemporal end points of hand trajectories, where hands come into contact with objects. Appearance is modeled through deep spatiotemporal features produced through existing methods. We note that in Ego-OMG it is simple to swap these appearance features, and thus Ego-OMG is complementary to most existing action anticipation methods. We evaluate Ego-OMG on the EPIC Kitchens Action Anticipation Challenge. The consistency of the egocentric perspective of EPIC Kitchens allows for the utilization of the hand-centric cues upon which Ego-OMG relies. We demonstrate state-of-the-art performance, outranking all other previous published methods by large margins and ranking first on the unseen test set and second on the seen test set of the EPIC Kitchens Action Anticipation Challenge. We attribute the success of Ego-OMG to the modeling of semantic structure captured over long timespans. We evaluate the design choices made through several ablation studies. Code will be released upon acceptance

READ FULL TEXT

page 3

page 6

research
02/01/2021

Forecasting Action through Contact Representations from First Person Video

Human actions involving hand manipulations are structured according to t...
research
08/27/2019

Global-Local Temporal Representations For Video Person Re-Identification

This paper proposes the Global-Local Temporal Representation (GLTR) to e...
research
03/31/2020

Long Short-Term Relation Networks for Video Action Detection

It has been well recognized that modeling human-object or object-object ...
research
06/12/2019

Recognizing Manipulation Actions from State-Transformations

Manipulation actions transform objects from an initial state into a fina...
research
03/27/2016

Recurrent Mixture Density Network for Spatiotemporal Visual Attention

In many computer vision tasks, the relevant information to solve the pro...
research
06/13/2022

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

Recent action recognition models have achieved impressive results by int...
research
04/16/2021

Spatiotemporal Deformable Models for Long-Term Complex Activity Detection

Long-term complex activity recognition and localisation can be crucial f...

Please sign up or login with your details

Forgot password? Click here to reset