Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

07/28/2023
by   Yifei Xin, et al.
0

Most existing audio-text retrieval (ATR) methods focus on constructing contrastive pairs between whole audio clips and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short segments and phrases or frames and words. In this paper, we introduce a hierarchical cross-modal interaction (HCI) method for ATR by simultaneously exploring clip-sentence, segment-phrase, and frame-word relationships, achieving a comprehensive multi-modal semantic comparison. Besides, we also present a novel ATR framework that leverages auxiliary captions (AC) generated by a pretrained captioner to perform feature interaction between audio and generated captions, which yields enhanced audio representations and is complementary to the original ATR matching branch. The audio and generated captions can also form new audio-text pairs as data augmentation for training. Experiments show that our HCI significantly improves the ATR performance. Moreover, our AC framework also shows stable performance gains on multiple datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/07/2022

HunYuan_tvr for Text-Video Retrivial

Text-Video Retrieval plays an important role in multi-modal understandin...
research
11/21/2022

TimbreCLIP: Connecting Timbre to Text and Images

We present work in progress on TimbreCLIP, an audio-text cross modal emb...
research
09/16/2023

Contrastive Latent Space Reconstruction Learning for Audio-Text Retrieval

Cross-modal retrieval (CMR) has been extensively applied in various doma...
research
04/05/2023

Calibrating Cross-modal Feature for Text-Based Person Searching

We present a novel and effective method calibrating cross-modal features...
research
10/06/2022

Matching Text and Audio Embeddings: Exploring Transfer-learning Strategies for Language-based Audio Retrieval

We present an analysis of large-scale pretrained deep learning models us...
research
03/10/2023

Improving Text-Audio Retrieval by Text-aware Attention Pooling and Prior Matrix Revised Loss

In text-audio retrieval (TAR) tasks, due to the heterogeneity of content...
research
06/27/2021

Query-graph with Cross-gating Attention Model for Text-to-Audio Grounding

In this paper, we address the text-to-audio grounding issue, namely, gro...

Please sign up or login with your details

Forgot password? Click here to reset