Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals

09/18/2019
by   Shah Nawaz, et al.
0

We propose a novel deep training algorithm for joint representation of audio and visual information which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information. The proposed framework characterizes the shared latent space by leveraging the class centers which helps to eliminate the need for pairwise or triplet supervision. We quantitatively and qualitatively evaluate the proposed approach on VoxCeleb, a benchmarks audio-visual dataset on a multitude of tasks including cross-modal verification, cross-modal matching, and cross-modal retrieval. State-of-the-art performance is achieved on cross-modal verification and matching while comparable results are observed on the remaining applications. Our experiments demonstrate the effectiveness of the technique for cross-modal biometric applications.

READ FULL TEXT
research
08/10/2019

Deep Triplet Neural Networks with Cluster-CCA for Audio-Visual Cross-modal Retrieval

Cross-modal retrieval aims to retrieve data in one modality by a query i...
research
02/11/2021

A Multi-View Approach To Audio-Visual Speaker Verification

Although speaker verification has conventionally been an audio-only task...
research
09/16/2023

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Cross-modal retrieval (CMR) has been extensively applied in various doma...
research
03/10/2023

Single-branch Network for Multimodal Training

With the rapid growth of social media platforms, users are sharing billi...
research
04/01/2020

Shared Cross-Modal Trajectory Prediction for Autonomous Driving

We propose a framework for predicting future trajectories of traffic age...
research
02/07/2022

Unsupervised physics-informed disentanglement of multimodal data for high-throughput scientific discovery

We introduce physics-informed multimodal autoencoders (PIMA) - a variati...
research
09/14/2019

Joint Wasserstein Autoencoders for Aligning Multimodal Embeddings

One of the key challenges in learning joint embeddings of multiple modal...

Please sign up or login with your details

Forgot password? Click here to reset