Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

by   Chuhan Zhang, et al.

We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this). We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art – even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions. Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding.


page 3

page 6

page 9


Is an Object-Centric Video Representation Beneficial for Transfer?

The objective of this work is to learn an object-centric video represent...

Generative Video Transformer: Can Objects be the Words?

Transformers have been successful for many natural language processing t...

Patch-based Object-centric Transformers for Efficient Video Generation

In this work, we present Patch-based Object-centric Video Transformer (P...

Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions

We aim for zero-shot localization and classification of human actions in...

Human-centric Spatio-Temporal Video Grounding via the Combination of Mutual Matching Network and TubeDETR

In this technical report, we represent our solution for the Human-centri...

Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning

We study the problem of dynamic visual reasoning on raw videos. This is ...

Segmenting Moving Objects via an Object-Centric Layered Representation

The objective of this paper is a model that is able to discover, track a...

Please sign up or login with your details

Forgot password? Click here to reset