SAS Video-QA: Self-Adaptive Sampling for Efficient Video Question-Answering

by   Wei Han, et al.

Video question–answering is a fundamental task in the field of video understanding. Although current vision–language models (VLMs) equipped with Video Transformers have enabled temporal modeling and yielded superior results, they are at the cost of huge computational power and thus too expensive to deploy in real-time application scenarios. An economical workaround only samples a small portion of frames to represent the main content of that video and tune an image–text model on these sampled frames. Recent video understanding models usually randomly sample a set of frames or clips, regardless of internal correlations between their visual contents, nor their relevance to the problem. We argue that such kinds of aimless sampling may omit the key frames from which the correct answer can be deduced, and the situation gets worse when the sampling sparsity increases, which always happens as the video lengths increase. To mitigate this issue, we propose two frame sampling strategies, namely the most domain frames (MDF) and most implied frames (MIF), to maximally preserve those frames that are most likely vital to the given questions. MDF passively minimizes the risk of key frame omission in a bootstrap manner, while MIS actively searches key frames customized for each video–question pair with the assistance of auxiliary models. The experimental results on three public datasets from three advanced VLMs (CLIP, GIT and All-in-one) demonstrate that our proposed strategies can boost the performance for image–text pretrained models. The source codes pertaining to the method proposed in this paper are publicly available at


CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images

Most existing research on visual question answering (VQA) is limited to ...

Self-Chained Image-Language Model for Video Localization and Question Answering

Recent studies have shown promising results on utilizing pre-trained ima...

Revealing Single Frame Bias for Video-and-Language Learning

Training an effective video-and-language model intuitively requires mult...

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that ca...

SViTT: Temporal Learning of Sparse Video-Text Transformers

Do video-text transformers learn to model temporal relationships across ...

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

The canonical approach to video-and-language learning (e.g., video quest...

Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism

In this paper, Gated-ViGAT, an efficient approach for video event recogn...

Please sign up or login with your details

Forgot password? Click here to reset