Ego-Only: Egocentric Action Detection without Exocentric Pretraining

01/03/2023
by   Huiyu Wang, et al.
0

We present Ego-Only, the first training pipeline that enables state-of-the-art action detection on egocentric (first-person) videos without any form of exocentric (third-person) pretraining. Previous approaches found that egocentric models cannot be trained effectively from scratch and that exocentric representations transfer well to first-person videos. In this paper we revisit these two observations. Motivated by the large content and appearance gap separating the two domains, we propose a strategy that enables effective training of egocentric models without exocentric pretraining. Our Ego-Only pipeline is simple. It trains the video representation with a masked autoencoder finetuned for temporal segmentation. The learned features are then fed to an off-the-shelf temporal action localization method to detect actions. We evaluate our approach on two established egocentric video datasets: Ego4D and EPIC-Kitchens-100. On Ego4D, our Ego-Only is on-par with exocentric pretraining methods that use an order of magnitude more labels. On EPIC-Kitchens-100, our Ego-Only even outperforms exocentric pretraining (by 2.1

READ FULL TEXT

page 1

page 10

page 12

research
10/12/2022

Self-supervised video pretraining yields strong image representations

Videos contain far more information than still images and hold the poten...
research
11/23/2020

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Understanding videos is challenging in computer vision. In particular, t...
research
04/29/2020

Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube

Pretraining from unlabelled web videos has quickly become the de-facto m...
research
11/23/2022

Multi-Environment Pretraining Enables Transfer to Action Limited Datasets

Using massive datasets to train large-scale models has emerged as a domi...
research
07/10/2020

AViD Dataset: Anonymized Videos from Diverse Countries

We introduce a new public video dataset for action recognition: Anonymiz...
research
07/24/2023

Multiscale Video Pretraining for Long-Term Activity Forecasting

Long-term activity forecasting is an especially challenging research pro...
research
04/21/2021

Improving BERT Pretraining with Syntactic Supervision

Bidirectional masked Transformers have become the core theme in the curr...

Please sign up or login with your details

Forgot password? Click here to reset