Audio-text Retrieval in Context

03/25/2022
by   Siyu Lou, et al.
0

Audio-text retrieval based on natural language descriptions is a challenging task. It involves learning cross-modality alignments between long sequences under inadequate data conditions. In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment. Moreover, through a qualitative analysis we observe that semantic mapping is more important than temporal relations in contextual retrieval. Using pre-trained audio features and a descriptor-based aggregation method, we build our contextual audio-text retrieval system. Specifically, we utilize PANNs features pre-trained on a large sound event dataset and NetRVLAD pooling, which directly works with averaged descriptors. Experiments are conducted on the AudioCaps and CLOTHO datasets, and results are compared with the previous state-of-the-art system. With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.

READ FULL TEXT
research
12/17/2021

Audio Retrieval with Natural Language Queries: A Benchmark Study

The objectives of this work are cross-modal text-audio and audio-text re...
research
07/10/2023

HCLAS-X: Hierarchical and Cascaded Lyrics Alignment System Using Multimodal Cross-Correlation

In this work, we address the challenge of lyrics alignment, which involv...
research
09/21/2021

Audio Interval Retrieval using Convolutional Neural Networks

Modern streaming services are increasingly labeling videos based on thei...
research
08/08/2023

Advancing Natural-Language Based Audio Retrieval with PaSST and Large Audio-Caption Data Sets

This work presents a text-to-audio-retrieval system based on pre-trained...
research
03/10/2023

Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

In text-audio retrieval (TAR) tasks, due to the heterogeneity of content...
research
08/22/2023

Furnishing Sound Event Detection with Language Model Abilities

Recently, the ability of language models (LMs) has attracted increasing ...
research
11/08/2022

On Negative Sampling for Contrastive Audio-Text Retrieval

This paper investigates negative sampling for contrastive learning in th...

Please sign up or login with your details

Forgot password? Click here to reset