SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition

04/10/2022
by   Victor Escorcia, et al.
14

Learning an egocentric action recognition model from video data is challenging due to distractors (e.g., irrelevant objects) in the background. Further integrating object information into an action model is hence beneficial. Existing methods often leverage a generic object detector to identify and represent the objects in the scene. However, several important issues remain. Object class annotations of good quality for the target domain (dataset) are still required for learning good object representation. Besides, previous methods deeply couple the existing action models and need to retrain them jointly with object representation, leading to costly and inflexible integration. To overcome both limitations, we introduce Self-Supervised Learning Over Sets (SOS), an approach to pre-train a generic Objects In Contact (OIC) representation model from video object regions detected by an off-the-shelf hand-object contact detector. Instead of augmenting object regions individually as in conventional self-supervised learning, we view the action process as a means of natural data transformations with unique spatio-temporal continuity and exploit the inherent relationships among per-video object sets. Extensive experiments on two datasets, EPIC-KITCHENS-100 and EGTEA, show that our OIC significantly boosts the performance of multiple state-of-the-art video classification models.

READ FULL TEXT

page 1

page 3

page 4

page 5

page 6

page 12

research
09/10/2019

Video Representation Learning by Dense Predictive Coding

The objective of this paper is self-supervised learning of spatio-tempor...
research
07/08/2021

Video 3D Sampling for Self-supervised Representation Learning

Most of the existing video self-supervised methods mainly leverage tempo...
research
01/02/2020

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

We propose a novel self-supervised method, referred to as Video Cloze Pr...
research
03/05/2020

Self-Supervised Spatio-Temporal Representation Learning Using Variable Playback Speed Prediction

We propose a self-supervised learning method by predicting the variable ...
research
08/05/2020

Self-supervised learning using consistency regularization of spatio-temporal data augmentation for action recognition

Self-supervised learning has shown great potentials in improving the dee...
research
08/03/2020

Memory-augmented Dense Predictive Coding for Video Representation Learning

The objective of this paper is self-supervised learning from video, in p...
research
12/21/2015

Harnessing the Deep Net Object Models for Enhancing Human Action Recognition

In this study, the influence of objects is investigated in the scenario ...

Please sign up or login with your details

Forgot password? Click here to reset