SBAT: Video Captioning with Sparse Boundary-Aware Transformer

07/23/2020
by   Tao Jin, et al.
0

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

READ FULL TEXT
research
04/15/2018

Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

A major challenge for video captioning is to combine audio and visual cu...
research
10/21/2020

TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog

Audio Visual Scene-aware Dialog (AVSD) is a task to generate responses w...
research
08/07/2023

Redundancy-aware Transformer for Video Question Answering

This paper identifies two kinds of redundancy in the current VideoQA par...
research
11/01/2019

Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning

This paper addresses the challenging task of video captioning which aims...
research
10/07/2019

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

The ability to generate natural language explanations conditioned on the...
research
10/13/2022

RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval

Video language pre-training methods have mainly adopted sparse sampling ...
research
04/01/2023

SVT: Supertoken Video Transformer for Efficient Video Understanding

Whether by processing videos with fixed resolution from start to end or ...

Please sign up or login with your details

Forgot password? Click here to reset