Enhancing Transformer Backbone for Egocentric Video Action Segmentation

05/19/2023
by   Sakib Reza, et al.
0

Egocentric temporal action segmentation in videos is a crucial task in computer vision with applications in various fields such as mixed reality, human behavior analysis, and robotics. Although recent research has utilized advanced visual-language frameworks, transformers remain the backbone of action segmentation models. Therefore, it is necessary to improve transformers to enhance the robustness of action segmentation models. In this work, we propose two novel ideas to enhance the state-of-the-art transformer for action segmentation. First, we introduce a dual dilated attention mechanism to adaptively capture hierarchical representations in both local-to-global and global-to-local contexts. Second, we incorporate cross-connections between the encoder and decoder blocks to prevent the loss of local context by the decoder. Additionally, we utilize state-of-the-art visual-language representation learning techniques to extract richer and more compact features for our transformer. Our proposed approach outperforms other state-of-the-art methods on the Georgia Tech Egocentric Activities (GTEA) and HOI4D Office Tools datasets, and we validate our introduced components with ablation studies. The source code and supplementary materials are publicly available on https://www.sail-nu.com/dxformer.

READ FULL TEXT

page 2

page 4

research
05/19/2022

Cross-Enhancement Transformer for Action Segmentation

Temporal convolutions have been the paradigm of choice in action segment...
research
06/12/2023

AerialFormer: Multi-resolution Transformer for Aerial Image Segmentation

Aerial Image Segmentation is a top-down perspective semantic segmentatio...
research
05/13/2023

Meta-Polyp: a baseline for efficient Polyp segmentation

In recent years, polyp segmentation has gained significant importance, a...
research
11/25/2021

NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition

Recently, Vision Transformers (ViT), with the self-attention (SA) as the...
research
06/26/2022

RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval

Seas of videos are uploaded daily with the popularity of social channels...
research
10/16/2021

ASFormer: Transformer for Action Segmentation

Algorithms for the action segmentation task typically use temporal model...
research
04/12/2023

MED-VT: Multiscale Encoder-Decoder Video Transformer with Application to Object Segmentation

Multiscale video transformers have been explored in a wide variety of vi...

Please sign up or login with your details

Forgot password? Click here to reset