Revealing Single Frame Bias for Video-and-Language Learning

06/07/2022
by   Jie Lei, et al.
1

Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastically-increased computation and memory costs resulting from using more frames. In this work, we explore single-frame models for video-and-language learning. On a diverse set of video-and-language tasks (including text-to-video retrieval and video question answering), we show the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training. This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets. Therefore, to allow for a more comprehensive evaluation of video-and-language models, we propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling. Our code is available at https://github.com/jayleicn/singularity

READ FULL TEXT

page 3

page 6

research
02/20/2023

STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training

Although large-scale video-language pre-training models, which usually b...
research
03/26/2023

Frame Flexible Network

Existing video recognition algorithms always conduct different training ...
research
09/17/2023

Towards Debiasing Frame Length Bias in Text-Video Retrieval via Causal Intervention

Many studies focus on improving pretraining or developing new backbones ...
research
07/25/2020

Approximated Bilinear Modules for Temporal Modeling

We consider two less-emphasized temporal properties of video: 1. Tempora...
research
07/09/2023

SAS Video-QA: Self-Adaptive Sampling for Efficient Video Question-Answering

Video question–answering is a fundamental task in the field of video und...
research
07/19/2019

Only Time Can Tell: Discovering Temporal Data for Temporal Modeling

Understanding temporal information and how the visual world changes over...
research
01/25/2022

Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition

We address the problem of capturing temporal information for video class...

Please sign up or login with your details

Forgot password? Click here to reset