Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos

10/19/2019
by   Kranti Kumar Parida, et al.
25

We present an audio-visual multimodal approach for the task of zeroshot learning (ZSL) for classification and retrieval of videos. ZSL has been studied extensively in the recent past but has primarily been limited to visual modality and to images. We demonstrate that both audio and visual modalities are important for ZSL for videos. Since a dataset to study the task is currently not available, we also construct an appropriate multimodal dataset with 33 classes containing 156,416 videos, from an existing large scale audio event dataset. We empirically show that the performance improves by adding audio modality for both tasks of zeroshot classification and retrieval, when using multimodal extensions of embedding learning methods. We also propose a novel method to predict the `dominant' modality using a jointly learned modality attention network. We learn the attention in a semi-supervised setting and thus do not require any additional explicit labelling for the modalities. We provide qualitative validation of the modality specific attention, which also successfully generalizes to unseen test classes.

READ FULL TEXT

page 3

page 8

page 12

page 13

research
03/23/2018

Audio-Visual Event Localization in Unconstrained Videos

In this paper, we introduce a novel problem of audio-visual event locali...
research
09/07/2021

Predicting Mood Disorder Symptoms with Remotely Collected Videos Using an Interpretable Multimodal Dynamic Attention Fusion Network

We developed a novel, interpretable multimodal classification method to ...
research
10/20/2022

A Multimodal Sensor Fusion Framework Robust to Missing Modalities for Person Recognition

Utilizing the sensor characteristics of the audio, visible camera, and t...
research
12/09/2014

Multimodal Transfer Deep Learning with Applications in Audio-Visual Recognition

We propose a transfer deep learning (TDL) framework that can transfer th...
research
07/23/2020

Sound2Sight: Generating Visual Dynamics from Sound and Context

Learning associations across modalities is critical for robust multimoda...
research
03/02/2021

Listen, Read, and Identify: Multimodal Singing Language Identification of Music

We propose a multimodal singing language classification model that uses ...
research
09/17/2016

GeThR-Net: A Generalized Temporally Hybrid Recurrent Neural Network for Multimodal Information Fusion

Data generated from real world events are usually temporal and contain m...

Please sign up or login with your details

Forgot password? Click here to reset