Video + CLIP Baseline for Ego4D Long-term Action Anticipation

07/01/2022
by   Srijan Das, et al.
0

In this report, we introduce our adaptation of image-text models for long-term action anticipation. Our Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network. The CLIP embedding provides fine-grained understanding of objects relevant for an action whereas the slowfast network is responsible for modeling temporal information within a video clip of few frames. We show that the features obtained from both encoders are complementary to each other, thus outperforming the baseline on Ego4D for the task of long-term action anticipation. Our code is available at github.com/srijandas07/clip_baseline_LTA_Ego4d.

READ FULL TEXT

page 2

page 3

research
08/22/2023

How Much Temporal Long-Term Context is Needed for Action Segmentation?

Modeling long-term context in videos is crucial for many fine-grained ta...
research
12/31/2022

Translating Text Synopses to Video Storyboards

A storyboard is a roadmap for video creation which consists of shot-by-s...
research
03/16/2023

TemporalMaxer: Maximize Temporal Context with only Max Pooling for Temporal Action Localization

Temporal Action Localization (TAL) is a challenging task in video unders...
research
05/02/2023

Long-Term Rhythmic Video Soundtracker

We consider the problem of generating musical soundtracks in sync with r...
research
12/09/2018

A Structured Model For Action Detection

A dominant paradigm for learning-based approaches in computer vision is ...
research
07/04/2023

Technical Report for Ego4D Long Term Action Anticipation Challenge 2023

In this report, we describe the technical details of our approach for th...
research
04/24/2018

ECO: Efficient Convolutional Network for Online Video Understanding

The state of the art in video understanding suffers from two problems: (...

Please sign up or login with your details

Forgot password? Click here to reset