Semi-Parametric Video-Grounded Text Generation

01/27/2023
by   Sungdong Kim, et al.
0

Efficient video-language modeling should consider the computational cost because of a large, sometimes intractable, number of video frames. Parametric approaches such as the attention mechanism may not be ideal since its computational cost quadratically increases as the video length increases. Rather, previous studies have relied on offline feature extraction or frame sampling to represent the video efficiently, focusing on cross-modal modeling in short video clips. In this paper, we propose a semi-parametric video-grounded text generation model, SeViT, a novel perspective on scalable video-language modeling toward long untrimmed videos. Treating a video as an external data store, SeViT includes a non-parametric frame retriever to select a few query-relevant frames from the data store for a given query and a parametric generator to effectively aggregate the frames with the query via late fusion methods. Experimental results demonstrate our method has a significant advantage in longer videos and causal video understanding. Moreover, our model achieves the new state of the art on four video-language datasets, iVQA (+4.8), Next-QA (+6.9), and Activitynet-QA (+4.8) in accuracy, and MSRVTT-Caption (+3.6) in CIDEr.

READ FULL TEXT

page 15

page 16

research
10/10/2022

Contrastive Video-Language Learning with Fine-grained Frame Sampling

Despite recent progress in video and language representation learning, t...
research
07/05/2019

Video Question Generation via Cross-Modal Self-Attention Networks Learning

Video Question Answering (Video QA) is a critical and challenging task i...
research
12/20/2014

Video (language) modeling: a baseline for generative models of natural videos

We propose a strong baseline model for unsupervised feature learning usi...
research
10/16/2022

Efficient Cross-Modal Video Retrieval with Meta-Optimized Frames

Cross-modal video retrieval aims to retrieve the semantically relevant v...
research
04/07/2022

HunYuan_tvr for Text-Video Retrivial

Text-Video Retrieval plays an important role in multi-modal understandin...
research
03/11/2023

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Joint video-language learning has received increasing attention in recen...
research
04/27/2021

FrameExit: Conditional Early Exiting for Efficient Video Recognition

In this paper, we propose a conditional early exiting framework for effi...

Please sign up or login with your details

Forgot password? Click here to reset