On Metric Learning for Audio-Text Cross-Modal Retrieval

03/29/2022
by   Xinhao Mei, et al.
0

Audio-text retrieval aims at retrieving a target audio clip or caption from a pool of candidates given a query in another modality. Solving such cross-modal retrieval task is challenging because it not only requires learning robust feature representations for both modalities, but also requires capturing the fine-grained alignment between these two modalities. Existing cross-modal retrieval models are mostly optimized by metric learning objectives as both of them attempt to map data to an embedding space, where similar data are close together and dissimilar data are far apart. Unlike other cross-modal retrieval tasks such as image-text and video-text retrievals, audio-text retrieval is still an unexplored task. In this work, we aim to study the impact of different metric learning objectives on the audio-text retrieval task. We present an extensive evaluation of popular metric learning objectives on the AudioCaps and Clotho datasets. We demonstrate that NT-Xent loss adapted from self-supervised learning shows stable performance across different datasets and training settings, and outperforms the popular triplet-based losses.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/21/2019

Learning Joint Embedding for Cross-Modal Retrieval

A cross-modal retrieval process is to use a query in one modality to obt...
research
03/10/2021

Cross-modal Image Retrieval with Deep Mutual Information Maximization

In this paper, we study the cross-modal image retrieval, where the input...
research
10/06/2022

Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

We present an analysis of large-scale pretrained deep learning models us...
research
07/16/2020

Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval

The abundance of multimodal data (e.g. social media posts) has inspired ...
research
02/28/2023

Joint Representations of Text and Knowledge Graphs for Retrieval and Evaluation

A key feature of neural models is that they can produce semantic vector ...
research
07/23/2019

Multisensory Learning Framework for Robot Drumming

The hype about sensorimotor learning is currently reaching high fever, t...
research
03/27/2020

Unsupervised Cross-Modal Audio Representation Learning from Unstructured Multilingual Text

We present an approach to unsupervised audio representation learning. Ba...

Please sign up or login with your details

Forgot password? Click here to reset