MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

12/19/2022
by   Difei Gao, et al.
0

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multi-modal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multi-event and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at computation efficiency and interpretability.

READ FULL TEXT

page 1

page 2

page 4

page 8

page 11

research
09/08/2022

Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

Multi-modal video question answering aims to predict correct answer and ...
research
05/10/2022

Learning to Answer Visual Questions from Web Videos

Recent methods for visual question answering rely on large-scale annotat...
research
10/11/2022

AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization

An audio-visual event (AVE) is denoted by the correspondence of the visu...
research
10/08/2022

Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling

While recent large-scale video-language pre-training made great progress...
research
05/16/2023

Is a Video worth n× n Images? A Highly Efficient Approach to Transformer-based Video Question Answering

Conventional Transformer-based Video Question Answering (VideoQA) approa...
research
09/16/2023

MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer

Precise crop yield prediction provides valuable information for agricult...
research
05/11/2023

Self-Chained Image-Language Model for Video Localization and Question Answering

Recent studies have shown promising results on utilizing pre-trained ima...

Please sign up or login with your details

Forgot password? Click here to reset