Temporal DINO: A Self-supervised Video Strategy to Enhance Action Prediction

08/08/2023
by   Izzeddin Teeti, et al.
0

The emerging field of action prediction plays a vital role in various computer vision applications such as autonomous driving, activity analysis and human-computer interaction. Despite significant advancements, accurately predicting future actions remains a challenging problem due to high dimensionality, complex dynamics and uncertainties inherent in video data. Traditional supervised approaches require large amounts of labelled data, which is expensive and time-consuming to obtain. This paper introduces a novel self-supervised video strategy for enhancing action prediction inspired by DINO (self-distillation with no labels). The Temporal-DINO approach employs two models; a 'student' processing past frames; and a 'teacher' processing both past and future frames, enabling a broader temporal context. During training, the teacher guides the student to learn future context by only observing past frames. The strategy is evaluated on ROAD dataset for the action prediction downstream task using 3D-ResNet, Transformer, and LSTM architectures. The experimental results showcase significant improvements in prediction performance across these architectures, with our method achieving an average enhancement of 9.9 enhancing the backbones' capabilities of capturing long-term dependencies. Furthermore, our approach demonstrates efficiency regarding the pretraining dataset size and the number of epochs required. This method overcomes limitations present in other approaches, including considering various backbone architectures, addressing multiple prediction horizons, reducing reliance on hand-crafted augmentations, and streamlining the pretraining process into a single stage. These findings highlight the potential of our approach in diverse video-based tasks such as activity recognition, motion planning, and scene understanding.

READ FULL TEXT

page 2

page 6

research
07/24/2023

Multiscale Video Pretraining for Long-Term Activity Forecasting

Long-term activity forecasting is an especially challenging research pro...
research
11/26/2021

Self-supervised Pretraining with Classification Labels for Temporal Activity Detection

Temporal Activity Detection aims to predict activity classes per frame, ...
research
08/22/2018

Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition

We propose a self-supervised learning method to jointly reason about spa...
research
11/11/2020

Unsupervised Video Representation Learning by Bidirectional Feature Prediction

This paper introduces a novel method for self-supervised video represent...
research
06/17/2021

Long-Short Temporal Contrastive Learning of Video Transformers

Video transformers have recently emerged as a competitive alternative to...
research
10/20/2022

Cyclical Self-Supervision for Semi-Supervised Ejection Fraction Prediction from Echocardiogram Videos

Left-ventricular ejection fraction (LVEF) is an important indicator of h...
research
01/07/2021

Learning Temporal Dynamics from Cycles in Narrated Video

Learning to model how the world changes as time elapses has proven a cha...

Please sign up or login with your details

Forgot password? Click here to reset