Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

04/22/2021
by   Yanbei Chen, et al.
8

Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even though these data modalities may not be semantically correlated. Rather than directly aligning the representations of different modalities, we compose audio, image, and video representations across modalities to uncover richer multi-modal knowledge. Our main idea is to learn a compositional embedding that closes the cross-modal semantic gap and captures the task-relevant semantics, which facilitates pulling together representations across modalities by compositional contrastive learning. We establish a new, comprehensive multi-modal distillation benchmark on three video datasets: UCF101, ActivityNet, and VGGSound. Moreover, we demonstrate that our model significantly outperforms a variety of existing knowledge distillation methods in transferring audio-visual knowledge to improve video representation learning. Code is released here: https://github.com/yanbeic/CCL.

READ FULL TEXT

page 6

page 8

research
11/25/2022

XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning

We present XKD, a novel self-supervised framework to learn meaningful re...
research
05/16/2022

Noise-Tolerant Learning for Audio-Visual Action Recognition

Recently, video recognition is emerging with the help of multi-modal lea...
research
07/27/2022

AutoTransition: Learning to Recommend Video Transition Effects

Video transition effects are widely used in video editing to connect sho...
research
12/09/2014

Multimodal Transfer Deep Learning with Applications in Audio-Visual Recognition

We propose a transfer deep learning (TDL) framework that can transfer th...
research
04/27/2021

One Billion Audio Sounds from GPU-enabled Modular Synthesis

We release synth1B1, a multi-modal audio corpus consisting of 1 billion ...
research
06/07/2019

Evolving Losses for Unlabeled Video Representation Learning

We present a new method to learn video representations from unlabeled da...
research
03/11/2021

Multi-Format Contrastive Learning of Audio Representations

Recent advances suggest the advantage of multi-modal training in compari...

Please sign up or login with your details

Forgot password? Click here to reset