XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning

11/25/2022
by   Pritam Sarkar, et al.
0

We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled video clips. XKD is trained with two pseudo tasks. First, masked data reconstruction is performed to learn modality-specific representations. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through teacher-student setups to learn complementary information. To identify the most effective information to transfer and also to tackle the domain gap between audio and visual modalities which could hinder knowledge transfer, we introduce a domain alignment strategy for effective cross-modal distillation. Lastly, to develop a general-purpose solution capable of handling both audio and visual streams, a modality-agnostic variant of our proposed framework is introduced, which uses the same backbone for both audio and visual modalities. Our proposed cross-modal knowledge distillation improves linear evaluation top-1 accuracy of video action classification by 8.4 Kinetics-Sound, and 14.2 variant shows promising results in developing a general-purpose network capable of handling different data streams. The code is released on the project website.

READ FULL TEXT

page 3

page 6

research
08/26/2022

CMD: Self-supervised 3D Action Representation Learning with Cross-modal Mutual Distillation

In 3D action recognition, there exists rich complementary information be...
research
04/22/2021

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Having access to multi-modal cues (e.g. vision and audio) empowers some ...
research
09/20/2023

Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

Sound can convey significant information for spatial reasoning in our da...
research
12/12/2020

Periocular in the Wild Embedding Learning with Cross-Modal Consistent Knowledge Distillation

Periocular biometric, or peripheral area of ocular, is a collaborative a...
research
06/28/2023

A Dimensional Structure based Knowledge Distillation Method for Cross-Modal Learning

Due to limitations in data quality, some essential visual tasks are diff...
research
10/11/2020

Boosting Continuous Sign Language Recognition via Cross Modality Augmentation

Continuous sign language recognition (SLR) deals with unaligned video-te...
research
03/28/2018

Probabilistic Knowledge Transfer for Deep Representation Learning

Knowledge Transfer (KT) techniques tackle the problem of transferring th...

Please sign up or login with your details

Forgot password? Click here to reset