DeepAI AI Chat
Log In Sign Up

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

10/26/2021
by   Tanzila Rahman, et al.
The University of British Columbia
0

The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visual-linguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-visual detection or segmentation tasks such as sound source separation and localization. In this work, we introduce TriBERT – a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention. The use of pose keypoints is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more persons are responsible for the sound explicitly (e.g., talking) or implicitly (e.g., sound produced as a function of human manipulating an object). From a technical perspective, as part of the TriBERT architecture, we introduce a learned visual tokenization scheme based on spatial attention and leverage weak-supervision to allow granular cross-modal interactions for visual and pose modalities. Further, we supplement learning with sound-source separation loss formulated across all three streams. We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. In addition, we show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as 66.7

READ FULL TEXT

page 9

page 10

12/07/2022

iQuery: Instruments as Queries for Audio-Visual Sound Separation

Current audio-visual separation methods share a standard architecture de...
03/05/2022

Audio-visual speech separation based on joint feature representation with cross-modal attention

Multi-modal based speech separation has exhibited a specific advantage o...
03/24/2022

AIMusicGuru: Music Assisted Human Pose Correction

Pose Estimation techniques rely on visual cues available through observa...
01/26/2020

Curriculum Audiovisual Learning

Associating sound and its producer in complex audiovisual scene is a cha...
01/14/2020

Deep Audio-Visual Learning: A Survey

Audio-visual learning, aimed at exploiting the relationship between audi...
07/20/2020

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

Stereophonic audio is an indispensable ingredient to enhance human audit...
05/03/2023

AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation

Segment Anything Model (SAM) has recently shown its powerful effectivene...

Code Repositories

TriBERT

Code Release for the paper "TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation"


view repo