Audio Representation Learning by Distilling Video as Privileged Information

02/06/2023
by   Amirhossein Hajavi, et al.
0

Deep audio representation learning using multi-modal audio-visual data often leads to a better performance compared to uni-modal approaches. However, in real-world scenarios both modalities are not always available at the time of inference, leading to performance degradation by models trained for multi-modal inference. In this work, we propose a novel approach for deep audio representation learning using audio-visual data when the video modality is absent at inference. For this purpose, we adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI). While the previous methods proposed for LUPI use soft-labels generated by the teacher, in our proposed method we use embeddings learned by the teacher to train the student network. We integrate our method in two different settings: sequential data where the features are divided into multiple segments throughout time, and non-sequential data where the entire features are treated as one whole segment. In the non-sequential setting both the teacher and student networks are comprised of an encoder component and a task header. We use the embeddings produced by the encoder component of the teacher to train the encoder of the student, while the task header of the student is trained using ground-truth labels. In the sequential setting, the networks have an additional aggregation component that is placed between the encoder and task header. We use two sets of embeddings produced by the encoder and aggregation component of the teacher to train the student. Similar to the non-sequential setting, the task header of the student network is trained using ground-truth labels. We test our framework on two different audio-visual tasks, namely speaker recognition and speech emotion recognition and show considerable improvements over sole audio-based recognition as well as prior works that use LUPI.

READ FULL TEXT
research
06/17/2021

Knowledge distillation from multi-modal to mono-modal segmentation networks

The joint use of multiple imaging modalities for medical image segmentat...
research
02/10/2020

Multitask Emotion Recognition with Incomplete Labels

We train a unified model to perform three tasks: facial action unit dete...
research
08/17/2022

Progressive Cross-modal Knowledge Distillation for Human Action Recognition

Wearable sensor-based Human Action Recognition (HAR) has achieved remark...
research
04/08/2021

GKD: Semi-supervised Graph Knowledge Distillation for Graph-Independent Inference

The increased amount of multi-modal medical data has opened the opportun...
research
09/15/2022

Layerwise Bregman Representation Learning with Applications to Knowledge Distillation

In this work, we propose a novel approach for layerwise representation l...
research
03/07/2022

Visually Supervised Speaker Detection and Localization via Microphone Array

Active speaker detection (ASD) is a multi-modal task that aims to identi...
research
07/08/2021

Multitask Multi-database Emotion Recognition

In this work, we introduce our submission to the 2nd Affective Behavior ...

Please sign up or login with your details

Forgot password? Click here to reset