AdVerb: Visually Guided Audio Dereverberation

08/23/2023
by   Sanjoy Chowdhury, et al.
0

We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio. Although audio-only dereverberation is a well-studied problem, our approach incorporates the complementary visual modality to perform audio dereverberation. Given an image of the environment where the reverberated sound signal has been recorded, AdVerb employs a novel geometry-aware cross-modal transformer architecture that captures scene geometry and audio-visual cross-modal relationship to generate a complex ideal ratio mask, which, when applied to the reverberant audio predicts the clean sound. The effectiveness of our method is demonstrated through extensive quantitative and qualitative evaluations. Our approach significantly outperforms traditional audio-only and audio-visual baselines on three downstream tasks: speech enhancement, speech recognition, and speaker verification, with relative improvements in the range of 18 satisfactory RT60 error scores on the AVSpeech dataset.

READ FULL TEXT

page 2

page 4

page 9

page 15

research
02/11/2021

A Multi-View Approach To Audio-Visual Speaker Verification

Although speaker verification has conventionally been an audio-only task...
research
02/14/2022

Visual Acoustic Matching

We introduce the visual acoustic matching task, in which an audio clip i...
research
11/09/2021

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity

We present CrissCross, a self-supervised framework for learning audio-vi...
research
03/31/2022

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis

Since facial actions such as lip movements contain significant informati...
research
08/10/2021

Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal Attention

Binaural audio gives the listener the feeling of being in the recording ...
research
08/18/2023

V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Building artificial intelligence (AI) systems on top of a set of foundat...
research
09/06/2021

Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural Sounds

Humans can robustly recognize and localize objects by using visual and/o...

Please sign up or login with your details

Forgot password? Click here to reset