Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

04/04/2023
by   Vladislav Lialin, et al.
0

Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty of collecting aligned data. Currently popular video-text data mining approach via automatic speech recognition (ASR) used in HowTo100M provides low-quality captions that often do not refer to the video content. Other mining approaches do not provide proper language descriptions (video tags) and are biased toward short clips (alt text). In this work, we show how recent advances in image captioning allow us to pre-train high-quality video models without any parallel video-text data. We pre-train several video captioning models that are based on an OPT language model and a TimeSformer visual backbone. We fine-tune these networks on several video captioning datasets. First, we demonstrate that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions. Second, we show that pre-training on both images and videos produces a significantly better network (+4 CIDER on MSR-VTT) than pre-training on a single modality. Our methods are complementary to the existing pre-training or data mining approaches and can be used in a variety of settings. Given the efficacy of the pseudolabeling method, we are planning to publicly release the generated captions.

READ FULL TEXT

page 2

page 3

page 5

research
08/04/2020

Weakly Supervised Construction of ASR Systems with Massive Video Data

Building Automatic Speech Recognition (ASR) systems from scratch is sign...
research
11/24/2021

Scaling Up Vision-Language Pre-training for Image Captioning

In recent years, we have witnessed significant performance boost in the ...
research
07/19/2023

Improving Multimodal Datasets with Image Captioning

Massive web datasets play a key role in the success of large vision-lang...
research
11/30/2021

CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning

For video captioning, "pre-training and fine-tuning" has become a de fac...
research
09/28/2020

VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training

It is highly desirable yet challenging to generate image captions that c...
research
12/28/2021

Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) – Team: MMCUniAugsburg

The Multimedia and Computer Vision Lab of the University of Augsburg par...
research
08/31/2017

Generating Video Descriptions with Topic Guidance

Generating video descriptions in natural language (a.k.a. video captioni...

Please sign up or login with your details

Forgot password? Click here to reset