Redundancy-aware Transformer for Video Question Answering

08/07/2023
by   Yicong Li, et al.
0

This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically, the current video encoders tend to holistically embed all video clues at different granularities in a hierarchical manner, which inevitably introduces neighboring-frame redundancy that can overwhelm detailed visual clues at the object level. Subsequently, prevailing vision-language fusion designs introduce the cross-modal redundancy by exhaustively fusing all visual elements with question tokens without explicitly differentiating their pairwise vision-language interactions, thus making a pernicious impact on the answering. To this end, we propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames, while adopting an out-of-neighboring message-passing scheme that imposes attention only on distant frames. As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions by identifying a small subset of visual elements that exclusively support the answer. Upon these advancements, we find this Redundancy-aware transformer (RaFormer) can achieve state-of-the-art results on multiple VideoQA benchmarks.

READ FULL TEXT

page 1

page 3

page 8

research
10/13/2022

RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval

Video language pre-training methods have mainly adopted sparse sampling ...
research
12/04/2021

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Referring image segmentation is a fundamental vision-language task that ...
research
07/23/2020

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

In this paper, we focus on the problem of applying the transformer struc...
research
06/19/2021

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Video Question Answering is a task which requires an AI agent to answer ...
research
12/02/2022

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

We present an effective method for fusing visual-and-language representa...
research
07/05/2019

Video Question Generation via Cross-Modal Self-Attention Networks Learning

Video Question Answering (Video QA) is a critical and challenging task i...
research
03/02/2023

MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering

Recently, finetuning pretrained vision-language models (VLMs) has become...

Please sign up or login with your details

Forgot password? Click here to reset