Log In Sign Up

Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training

by   Saurabh Sahu, et al.

The introduction of Transformer model has led to tremendous advancements in sequence modeling, especially in text domain. However, the use of attention-based models for video understanding is still relatively unexplored. In this paper, we introduce Gated Adversarial Transformer (GAT) to enhance the applicability of attention-based models to videos. GAT uses a multi-level attention gate to model the relevance of a frame based on local and global contexts. This enables the model to understand the video at various granularities. Further, GAT uses adversarial training to improve model generalization. We propose temporal attention regularization scheme to improve the robustness of attention modules to adversarial examples. We illustrate the performance of GAT on the large-scale YoutTube-8M data set on the task of video categorization. We further show ablation studies along with quantitative and qualitative analysis to showcase the improvement.


page 1

page 6

page 8


Can't Fool Me: Adversarially Robust Transformer for Video Understanding

Deep neural networks have been shown to perform poorly on adversarial ex...

Leveraging Local Temporal Information for Multimodal Scene Classification

Robust video scene classification models should capture the spatial (pix...

Impact of Attention on Adversarial Robustness of Image Classification Models

Adversarial attacks against deep learning models have gained significant...

Attention is all you need for Videos: Self-attention based Video Summarization using Universal Transformers

Video Captioning and Summarization have become very popular in the recen...

Multi-attention Networks for Temporal Localization of Video-level Labels

Temporal localization remains an important challenge in video understand...

Recur, Attend or Convolve? Frame Dependency Modeling Matters for Cross-Domain Robustness in Action Recognition

Most action recognition models today are highly parameterized, and evalu...