How Much Temporal Long-Term Context is Needed for Action Segmentation?

08/22/2023
by   Emad Bahrami, et al.
0

Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with the current state of the art on three datasets for temporal action segmentation, namely 50Salads, Breakfast, and Assembly101. Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation.

READ FULL TEXT

page 1

page 3

page 7

page 8

research
07/01/2022

Video + CLIP Baseline for Ego4D Long-term Action Anticipation

In this report, we introduce our adaptation of image-text models for lon...
research
03/16/2023

TemporalMaxer: Maximize Temporal Context with only Max Pooling for Temporal Action Localization

Temporal Action Localization (TAL) is a challenging task in video unders...
research
05/22/2017

TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation

Action segmentation as a milestone towards building automatic systems to...
research
09/28/2022

Streaming Video Temporal Action Segmentation In Real Time

Temporal action segmentation (TAS) is a critical step toward long-term v...
research
09/13/2022

Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos

Video action segmentation and recognition tasks have been widely applied...
research
08/28/2023

BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation

We address the task of supervised action segmentation which aims to part...
research
03/27/2023

Exemplar-based Video Colorization with Long-term Spatiotemporal Dependency

Exemplar-based video colorization is an essential technique for applicat...

Please sign up or login with your details

Forgot password? Click here to reset