Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

10/06/2022
by   Benno Weck, et al.
0

We present an analysis of large-scale pretrained deep learning models used for cross-modal (text-to-audio) retrieval. We use embeddings extracted by these models in a metric learning framework to connect matching pairs of audio and text. Shallow neural networks map the embeddings to a common dimensionality. Our system, which is an extension of our submission to the Language-based Audio Retrieval Task of the DCASE Challenge 2022, employs the RoBERTa foundation model as the text embedding extractor. A pretrained PANNs model extracts the audio embeddings. To improve the generalisation of our model, we investigate how pretraining with audio and associated noisy text collected from the online platform Freesound improves the performance of our method. Furthermore, our ablation study reveals that the proper choice of the loss function and fine-tuning the pretrained models are essential in training a competitive retrieval system.

READ FULL TEXT
research
03/29/2022

On Metric Learning for Audio-Text Cross-Modal Retrieval

Audio-text retrieval aims at retrieving a target audio clip or caption f...
research
05/05/2021

Audio Retrieval with Natural Language Queries

We consider the task of retrieving audio using free-form natural languag...
research
07/28/2023

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Most existing audio-text retrieval (ATR) methods focus on constructing c...
research
05/21/2023

Towards Robust Family-Infant Audio Analysis Based on Unsupervised Pretraining of Wav2vec 2.0 on Large-Scale Unlabeled Family Audio

To perform automatic family audio analysis, past studies have collected ...
research
05/22/2023

LEAN: Light and Efficient Audio Classification Network

Over the past few years, audio classification task on large-scale datase...
research
08/24/2022

Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio Text Augmentations

The absence of large labeled datasets remains a significant challenge in...
research
06/29/2022

Language-Based Audio Retrieval with Converging Tied Layers and Contrastive Loss

In this paper, we tackle the new Language-Based Audio Retrieval task pro...

Please sign up or login with your details

Forgot password? Click here to reset