Query-graph with Cross-gating Attention Model for Text-to-Audio Grounding

06/27/2021
by   Haoyu Tang, et al.
0

In this paper, we address the text-to-audio grounding issue, namely, grounding the segments of the sound event described by a natural language query in the untrimmed audio. This is a newly proposed but challenging audio-language task, since it requires to not only precisely localize all the on- and off-sets of the desired segments in the audio, but to perform comprehensive acoustic and linguistic understandings and reason the multimodal interactions between the audio and query. To tackle those problems, the existing method treats the query holistically as a single unit by a global query representation, which fails to highlight the keywords that contain rich semantics. Besides, this method has not fully exploited interactions between the query and audio. Moreover, since the audio and queries are arbitrary and variable in length, many meaningless parts of them are not filtered out in this method, which hinders the grounding of the desired segments. To this end, we propose a novel Query Graph with Cross-gating Attention (QGCA) model, which models the comprehensive relations between the words in query through a novel query graph. Besides, to capture the fine-grained interactions between audio and query, a cross-modal attention module that assigns higher weights to the keywords is introduced to generate the snippet-specific query representations. Finally, we also design a cross-gating module to emphasize the crucial parts as well as weaken the irrelevant ones in the audio and query. We extensively evaluate the proposed QGCA model on the public Audiogrounding dataset with significant improvements over several state-of-the-art methods. Moreover, further ablation study shows the consistent effectiveness of different modules in the proposed QGCA model.

READ FULL TEXT

page 1

page 2

page 8

research
02/23/2021

Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

Automated Audio Captioning is a cross-modal task, generating natural lan...
research
10/20/2022

Play It Back: Iterative Attention for Audio Recognition

A key function of auditory cognition is the association of characteristi...
research
01/18/2022

Unpaired Referring Expression Grounding via Bidirectional Cross-Modal Matching

Referring expression grounding is an important and challenging task in c...
research
09/11/2020

AttnGrounder: Talking to Cars with Attention

We propose Attention Grounder (AttnGrounder), a single-stage end-to-end ...
research
10/03/2022

Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity

Automatic Audio Captioning (AAC) refers to the task of translating an au...
research
07/28/2023

Improving Audio-Text Retrieval via Hierarchical Cross-Modal Interaction and Auxiliary Captions

Most existing audio-text retrieval (ATR) methods focus on constructing c...
research
01/08/2022

Learning Sample Importance for Cross-Scenario Video Temporal Grounding

The task of temporal grounding aims to locate video moment in an untrimm...

Please sign up or login with your details

Forgot password? Click here to reset