Is a Video worth n× n Images? A Highly Efficient Approach to Transformer-based Video Question Answering

05/16/2023
by   Chenyang Lyu, et al.
0

Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and question. However, such schema would incur significant memory use and inevitably slow down the training and inference speed. In this work, we present a highly efficient approach for VideoQA based on existing vision-language pre-trained models where we concatenate video frames to a n× n matrix and then convert it to one image. By doing so, we reduce the use of the image encoder from n^2 to 1 while maintaining the temporal structure of the original video. Experimental results on MSRVTT and TrafficQA show that our proposed approach achieves state-of-the-art performance with nearly 4× faster speed and only 30 memory use. We show that by integrating our approach into VideoQA systems we can achieve comparable, even superior, performance with a significant speed up for training and inference. We believe the proposed approach can facilitate VideoQA-related research by reducing the computational requirements for those who have limited access to budgets and resources. Our code will be made publicly available for research use.

READ FULL TEXT
research
05/14/2023

Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering

Event-Level Video Question Answering (EVQA) requires complex reasoning a...
research
10/07/2020

Cross-Thought for Sentence Encoder Pre-training

In this paper, we propose Cross-Thought, a novel approach to pre-trainin...
research
09/28/2022

TVLT: Textless Vision-Language Transformer

In this work, we present the Textless Vision-Language Transformer (TVLT)...
research
12/19/2022

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

To build Video Question Answering (VideoQA) systems capable of assisting...
research
03/17/2023

TKN: Transformer-based Keypoint Prediction Network For Real-time Video Prediction

Video prediction is a complex time-series forecasting task with great po...
research
07/04/2022

GraphVid: It Only Takes a Few Nodes to Understand a Video

We propose a concise representation of videos that encode perceptually m...
research
12/06/2022

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

We present a simple approach which can turn a ViT encoder into an effici...

Please sign up or login with your details

Forgot password? Click here to reset