Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal Retrieval

11/07/2022
by   Donghuo Zeng, et al.
0

The heterogeneity gap problem is the main challenge in cross-modal retrieval. Because cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared. To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels. TNN-CCCA is the best audio-visual cross-modal retrieval (AV-CMR) model so far, but the model training is sensitive to hard negative samples when learning common subspace by applying triplet loss to predict the relative distance between inputs. In this paper, to reduce the interference of hard negative samples in representation learning, we propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between audio-visual data using complete cross-triple loss. In particular, our model projects audio-visual features into label space by minimizing the distance between predicted label features after feature projection and ground label representations. Moreover, we adopt complete cross-triplet loss to optimize the predicted label features by leveraging the relationship between all possible similarity and dissimilarity semantic information across modalities. The extensive experimental results on two audio-visual double-checked datasets have shown an improvement of approximately 2.1 state-of-the-art method TNN-CCCA for the AV-CMR task, which indicates the effectiveness of our proposed model.

READ FULL TEXT

page 1

page 7

research
10/26/2021

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Learning common subspace is prevalent way in cross-modal retrieval to so...
research
08/10/2019

Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA

Deep learning has successfully shown excellent performance in learning j...
research
12/05/2021

Variational Autoencoder with CCA for Audio-Visual Cross-Modal Retrieval

Cross-modal retrieval is to utilize one modality as a query to retrieve ...
research
02/16/2022

Cross-Modal Common Representation Learning with Triplet Loss Functions

Common representation learning (CRL) learns a shared embedding between t...
research
03/27/2020

Unsupervised Cross-Modal Audio Representation Learning from Unstructured Multilingual Text

We present an approach to unsupervised audio representation learning. Ba...
research
05/07/2023

Cross-Modal Retrieval for Motion and Text via MildTriple Loss

Cross-modal retrieval has become a prominent research topic in computer ...
research
03/26/2019

Cross-modal subspace learning with Kernel correlation maximization and Discriminative structure preserving

The measure between heterogeneous data is still an open problem. Many re...

Please sign up or login with your details

Forgot password? Click here to reset