Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

11/03/2022
by   Junru Wu, et al.
0

Self-supervised pre-training recently demonstrates success on large-scale multimodal data, and state-of-the-art contrastive learning methods often enforce the feature consistency from cross-modality inputs, such as video/audio or video/text pairs. Despite its convenience to formulate and leverage in practice, such cross-modality alignment (CMA) is only a weak and noisy supervision, since two modalities can be semantically misaligned even they are temporally aligned. For example, even in the commonly adopted instructional videos, a speaker can sometimes refer to something that is not visually present in the current frame; and the semantic misalignment would only be more unpredictable for the raw videos from the internet. We conjecture that might cause conflicts and biases among modalities, and may hence prohibit CMA from scaling up to training with larger and more heterogeneous data. This paper first verifies our conjecture by observing that, even in the latest VATT pre-training using only instructional videos, there exist strong gradient conflicts between different CMA losses within the same video, audio, text triplet, indicating them as the noisy source of supervision. We then propose to harmonize such gradients, via two techniques: (i) cross-modality gradient realignment: modifying different CMA loss gradients for each sample triplet, so that their gradient directions are more aligned; and (ii) gradient-based curriculum learning: leveraging the gradient conflict information on an indicator of sample noisiness, to develop a curriculum learning strategy to prioritize training on less noisy sample triplets. Applying those techniques to pre-training VATT on the HowTo100M dataset, we consistently improve its performance on different downstream tasks. Moreover, we are able to scale VATT pre-training to more complicated non-narrative Youtube8M dataset to further improve the state-of-the-arts.

READ FULL TEXT

page 2

page 5

page 15

page 16

research
11/01/2021

Masking Modalities for Cross-modal Video Retrieval

Pre-training on large scale unlabelled datasets has shown impressive per...
research
03/12/2023

Accommodating Audio Modality in CLIP for Multimodal Processing

Multimodal processing has attracted much attention lately especially wit...
research
04/22/2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

We present a framework for learning multimodal representations from unla...
research
05/23/2023

Training Transitive and Commutative Multimodal Transformers with LoReTTa

Collecting a multimodal dataset with two paired modalities A and B or B ...
research
11/15/2020

Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions

The task of video and text sequence alignment is a prerequisite step tow...
research
02/13/2023

Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data

Language-supervised vision models have recently attracted great attentio...
research
02/14/2022

A Survey of Cross-Modality Brain Image Synthesis

The existence of completely aligned and paired multi-modal neuroimaging ...

Please sign up or login with your details

Forgot password? Click here to reset