Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training

07/05/2020
by   Yingwei Pan, et al.
11

In this work, we present Auto-captions on GIF, which is a new large-scale pre-training dataset for generic video understanding. All video-sentence pairs are created by automatically extracting and filtering video caption annotations from billions of web pages. Auto-captions on GIF dataset can be utilized to pre-train the generic feature representation or encoder-decoder structure for video captioning, and other downstream tasks (e.g., sentence localization in videos, video question answering, etc.) as well. We present a detailed analysis of Auto-captions on GIF dataset in comparison to existing video-sentence datasets. We also provide an evaluation of a Transformer-based encoder-decoder structure for vision-language pre-training, which is further adapted to video captioning downstream task and yields the compelling generalizability on MSR-VTT. The dataset is available at <http://www.auto-video-captions.top/2020/dataset>.

READ FULL TEXT

page 1

page 3

research
02/17/2021

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

The availability of large-scale image captioning and visual question ans...
research
10/27/2019

Memeify: A Large-Scale Meme Generation System

Interest in the research areas related to meme propagation and generatio...
research
01/11/2022

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

Vision-language pre-training has been an emerging and fast-developing re...
research
05/31/2023

Too Large; Data Reduction for Vision-Language Pre-Training

This paper examines the problems of severe image-text misalignment and h...
research
10/13/2021

CLIP4Caption: CLIP for Video Caption

Video captioning is a challenging task since it requires generating sent...
research
11/22/2021

RedCaps: web-curated image-text data created by the people, for the people

Large datasets of paired images and text have become increasingly popula...
research
08/26/2021

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision

Computer vision tasks such as object detection and semantic/instance seg...

Please sign up or login with your details

Forgot password? Click here to reset