Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

03/10/2023
by   Yifei Xin, et al.
0

In text-audio retrieval (TAR) tasks, due to the heterogeneity of contents between text and audio, the semantic information contained in the text is only similar to certain frames within the audio. Yet, existing works aggregate the entire audio without considering the text, such as mean-pooling over the frames, which is likely to encode misleading audio information not described in the given text. In this paper, we present a text-aware attention pooling (TAP) module for TAR, which is essentially a scaled dot product attention for a text to attend to its most semantically similar frames. Furthermore, previous methods only conduct the softmax for every single-side retrieval, ignoring the potential cross-retrieval information. By exploring the intrinsic prior of each text-audio pair, we introduce a prior matrix revised (PMR) loss to filter the hard case with high (or low) text-to-audio but low (or high) audio-to-text similarity scores, thus achieving the dual optimal match. Experiments show that our TAP significantly outperforms various text-agnostic pooling functions. Moreover, our PMR loss also shows stable performance gains on multiple datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/28/2022

X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval

In text-video retrieval, the objective is to learn a cross-modal similar...
research
09/09/2021

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Employing large-scale pre-trained model CLIP to conduct video-text retri...
research
03/25/2022

Audio-text Retrieval in Context

Audio-text retrieval based on natural language descriptions is a challen...
research
11/08/2022

On Negative Sampling for Contrastive Audio-Text Retrieval

This paper investigates negative sampling for contrastive learning in th...
research
07/28/2023

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Most existing audio-text retrieval (ATR) methods focus on constructing c...
research
06/29/2022

Language-Based Audio Retrieval with Converging Tied Layers and Contrastive Loss

In this paper, we tackle the new Language-Based Audio Retrieval task pro...
research
06/15/2022

Text-Aware End-to-end Mispronunciation Detection and Diagnosis

Mispronunciation detection and diagnosis (MDD) technology is a key compo...

Please sign up or login with your details

Forgot password? Click here to reset