TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

10/26/2021
by   Tanzila Rahman, et al.
0

The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visual-linguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-visual detection or segmentation tasks such as sound source separation and localization. In this work, we introduce TriBERT – a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention. The use of pose keypoints is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more persons are responsible for the sound explicitly (e.g., talking) or implicitly (e.g., sound produced as a function of human manipulating an object). From a technical perspective, as part of the TriBERT architecture, we introduce a learned visual tokenization scheme based on spatial attention and leverage weak-supervision to allow granular cross-modal interactions for visual and pose modalities. Further, we supplement learning with sound-source separation loss formulated across all three streams. We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. In addition, we show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as 66.7

READ FULL TEXT

page 9

page 10

research
12/07/2022

iQuery: Instruments as Queries for Audio-Visual Sound Separation

Current audio-visual separation methods share a standard architecture de...
research
03/05/2022

Audio-visual speech separation based on joint feature representation with cross-modal attention

Multi-modal based speech separation has exhibited a specific advantage o...
research
03/24/2022

AIMusicGuru: Music Assisted Human Pose Correction

Pose Estimation techniques rely on visual cues available through observa...
research
04/18/2019

Self-Supervised Audio-Visual Co-Segmentation

Segmenting objects in images and separating sound sources in audio are c...
research
01/26/2020

Curriculum Audiovisual Learning

Associating sound and its producer in complex audiovisual scene is a cha...
research
07/27/2023

PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data

Audio-visual learning seeks to enhance the computer's multi-modal percep...
research
07/10/2023

EchoVest: Real-Time Sound Classification and Depth Perception Expressed through Transcutaneous Electrical Nerve Stimulation

Over 1.5 billion people worldwide live with hearing impairment. Despite ...

Please sign up or login with your details

Forgot password? Click here to reset