Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022

06/15/2022
by   Elad Ben-Avraham, et al.
9

This technical report describes the SViT approach for the Ego4D Point of No Return (PNR) Temporal Localization Challenge. We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structured information, we enrich a transformer model with a set of object tokens that can be used across images and videos. Second, the scene representations of individual frames in video should "align" with those of still images. This is achieved via a "Frame-Clip Consistency" loss, which ensures the flow of structured information between images and videos. SViT obtains strong performance on the challenge test set with 0.656 absolute temporal localization error.

READ FULL TEXT

page 1

page 2

research
06/13/2022

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

Recent action recognition models have achieved impressive results by int...
research
11/16/2022

Exploring State Change Capture of Heterogeneous Backbones @ Ego4D Hands and Objects Challenge 2022

Capturing the state changes of interacting objects is a key technology f...
research
10/26/2021

Leveraging Local Temporal Information for Multimodal Scene Classification

Robust video scene classification models should capture the spatial (pix...
research
07/22/2022

Video Swin Transformers for Egocentric Video Understanding @ Ego4D Challenges 2022

We implemented Video Swin Transformer as a base architecture for the tas...
research
06/21/2021

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

In this paper, we introduce a novel visual representation learning which...
research
02/03/2023

Egocentric Video Task Translation @ Ego4D Challenge 2022

This technical report describes the EgoTask Translation approach that ex...
research
10/14/2019

ReActNet: Temporal Localization of Repetitive Activities in Real-World Videos

We address the problem of temporal localization of repetitive activities...

Please sign up or login with your details

Forgot password? Click here to reset