Temporal Aggregate Representations for Long Term Video Understanding

06/01/2020
by   Fadime Sener, et al.
0

Future prediction requires reasoning from current and past observations and raises several fundamental questions. How much past information is necessary? What is a reasonable temporal scale to process the past? How much semantic abstraction is required? We address all of these questions with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state-of-the-art results in both next action and dense anticipation using simple techniques such as max pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on the Breakfast Actions, 50Salads and EPIC-Kitchens datasets where we achieve state-of-the-art or comparable results. We also show that our model can be used for temporal video segmentation and action recognition with minimal modifications.

READ FULL TEXT

page 20

page 21

page 22

page 23

page 24

research
06/06/2021

Technical Report: Temporal Aggregate Representations

This technical report extends our work presented in [9] with more experi...
research
04/15/2016

Long-term Temporal Convolutions for Action Recognition

Typical human actions last several seconds and exhibit characteristic sp...
research
11/16/2016

Joint Network based Attention for Action Recognition

By extracting spatial and temporal characteristics in one network, the t...
research
05/26/2021

Anticipating human actions by correlating past with the future with Jaccard similarity measures

We propose a framework for early action recognition and anticipation by ...
research
03/16/2023

TemporalMaxer: Maximize Temporal Context with only Max Pooling for Temporal Action Localization

Temporal Action Localization (TAL) is a challenging task in video unders...
research
03/07/2020

TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation

Video action anticipation aims to predict future action categories from ...
research
09/13/2022

Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos

Video action segmentation and recognition tasks have been widely applied...

Please sign up or login with your details

Forgot password? Click here to reset