Robust Cross-Modal Knowledge Distillation for Unconstrained Videos

04/16/2023
by   Wenke Xia, et al.
0

Cross-modal distillation has been widely used to transfer knowledge across different modalities, enriching the representation of the target unimodal one. Recent studies highly relate the temporal synchronization between vision and sound to the semantic consistency for cross-modal distillation. However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation. To this end, we first propose a \textit{Modality Noise Filter} (MNF) module to erase the irrelevant noise in teacher modality with cross-modal context. After this purification, we then design a \textit{Contrastive Semantic Calibration} (CSC) module to adaptively distill useful knowledge for target modality, by referring to the differentiated sample-wise semantic correlation in a contrastive fashion. Extensive experiments show that our method could bring a performance boost compared with other distillation methods in both visual action recognition and video retrieval task. We also extend to the audio tagging task to prove the generalization of our method. The source code is available at \href{https://github.com/GeWu-Lab/cross-modal-distillation}{https://github.com/GeWu-Lab/cross-modal-distillation}.

READ FULL TEXT
research
08/26/2022

CMD: Self-supervised 3D Action Representation Learning with Cross-modal Mutual Distillation

In 3D action recognition, there exists rich complementary information be...
research
05/18/2021

Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

This paper proposes an approach to Dense Video Captioning (DVC) without ...
research
06/28/2023

A Dimensional Structure based Knowledge Distillation Method for Cross-Modal Learning

Due to limitations in data quality, some essential visual tasks are diff...
research
12/12/2020

Periocular in the Wild Embedding Learning with Cross-Modal Consistent Knowledge Distillation

Periocular biometric, or peripheral area of ocular, is a collaborative a...
research
05/12/2023

MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition

In this paper, we study a novel problem in egocentric action recognition...
research
07/27/2021

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Weakly supervised temporal action localization (WS-TAL) is a challenging...
research
12/16/2022

Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?

The success of deep learning heavily relies on large-scale data with com...

Please sign up or login with your details

Forgot password? Click here to reset