iQuery: Instruments as Queries for Audio-Visual Sound Separation

12/07/2022
by   Jiaben Chen, et al.
0

Current audio-visual separation methods share a standard architecture design where an audio encoder-decoder network is fused with visual encoding features at the encoder bottleneck. This design confounds the learning of multi-modal feature encoding with robust sound decoding for audio separation. To generalize to a new instrument: one must finetune the entire visual and audio network for all musical instruments. We re-formulate visual-sound separation task and propose Instrument as Query (iQuery) with a flexible query expansion mechanism. Our approach ensures cross-modal consistency and cross-instrument disentanglement. We utilize "visually named" queries to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms. To generalize to a new instrument or event class, drawing inspiration from the text-prompt design, we insert an additional query as an audio prompt while freezing the attention mechanism. Experimental results on three benchmarks demonstrate that our iQuery improves audio-visual sound source separation performance.

READ FULL TEXT

page 3

page 6

research
10/26/2021

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

The recent success of transformer models in language, such as BERT, has ...
research
03/25/2022

SeCo: Separating Unknown Musical Visual Sounds with Consistency Guidance

Recent years have witnessed the success of deep learning on the visual s...
research
08/10/2021

Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal Attention

Binaural audio gives the listener the feeling of being in the recording ...
research
04/11/2019

The Sound of Motions

Sounds originate from object motions and vibrations of surrounding air. ...
research
11/15/2021

Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention

Binaural audio gives the listener an immersive experience and can enhanc...
research
06/16/2019

Audio Transport: A Generalized Portamento via Optimal Transport

This paper proposes a new method to interpolate between two audio signal...
research
07/20/2022

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation

We introduce AudioScopeV2, a state-of-the-art universal audio-visual on-...

Please sign up or login with your details

Forgot password? Click here to reset