Audio reverberation occurs when multiple reflections from surfaces and objects in the environment build up then decay, altering the original audio signal. While reverberation bestows a realistic sense of spatial context, it also can degrade a listener’s experience. In particular, the quality of human speech is greatly affected by reverberant environments—as illustrated by how difficult it can be to parse the words of a family member speaking loudly from another room in the house, a tour guide describing the artwork down the hall of a magnificent cavernous cathedral, or a colleague participating in a Zoom call from a cafe. Consistent with the human perceptual experience, automatic speech recognition (ASR) systems noticeably suffer when given reverberant speech input reverb-challenge; zhao_monaural_2020; szoke_2019; han_learning_2015; wu_end--end_2017; ernst_speech_2019. Thus there is great need for intelligent dereverberation algorithms that can strip away reverb effects for speech enhancement, recognition, and other downstream tasks, which could in turn benefit many applications in teleconferencing, assistive hearing devices, augmented reality, and video indexing.
The audio community has made steady progress devising machine learning solutions to tackle speech dereverberationernst_speech_2019; giri_improving_2015; xvector; han_learning_2015; wu2016; zhao2019; su_hifi-gan_2020
. The general approach is to take a reverberant speech signal, usually represented with a Short-Time Fourier Transform (STFT) spectrogram, and feed it as input to a model that estimates a clean version of the signal with the reverberation removed. Past approaches have tackled this problem with signal processing and statistical techniqueswpe; naylor_gaubitch_2010
, while many modern approaches are based on neural networks that learn a mapping from reverberant to clean spectrogramshan_learning_2015; ernst_speech_2019; fu_metricgan_2019. To our knowledge, all existing models for dereverberation rely purely on audio. Unfortunately this often underconstrains the dereverberation task since the latent parameters of the recording space are not discernible from the audio alone.
However, we observe that in many practical settings of interest—video conferencing, augmented reality, Web video indexing—reverberant audio is naturally accompanied by a visual (video) stream. Importantly, the visual stream offers valuable cues about the room acoustics affecting reverberation: where are the walls, how are they shaped, where is the human speaker, what is the layout of major furniture, what are the room’s dominant materials (which affect absorption), and even what is the facial appearance and/or body shape of the person speaking (since body shape determines the acoustic properties of a person’s speech, and reverb time is frequency dependent). For example, reverb is typically stronger when the speaker is further away; speech is more reverberant in a large church or hallway; heavy carpet absorbs more sound. See Figure 1.
Our idea is to learn to dereverberate speech from audio-visual observations. In this task, the input is reverberant speech and visual observations of the environment surrounding the human speaker, and the output is a prediction of the clean source audio. To tackle this problem, there are two key technical challenges. First, how to model the multi-modal dereverberation process in order to infer the latent clean audio. Second, how to secure appropriate training data spanning a variety of physical environments for which we can sample speech with known ground truth (non-reverberant, anechoic) audio. The latter is also non-trivial because ordinary audio/video recordings are themselves corrupted by reverberation but lack the ground truth source signal we wish to recover.
For the modeling challenge, we introduce an end-to-end approach called Visually-Informed Dereverberation of Audio (VIDA). VIDA consists of a Visual Acoustics Network (VAN) that learns reverberation properties of the room geometry, object locations, and speaker position. Coupled with a multi-modal UNet dereverberation module, it learns to remove the reverberations from a single-channel audio stream. In addition, we propose an audio-visual (AV) matching loss to enforce consistency between the visually-inferred reverberation features and those inferred from the audio signal. We leverage the outputs of our model for multiple downstream tasks: speech enhancement, speech recognition, and speaker identification.
Next, to address the training data challenge, we develop a new large-scale audio-visual dataset using SoundSpaces chen_soundspaces_2020, a 3D simulator for real-world scanned environments that allows both visual and acoustic rendering. Our data approach inserts “clean" audio voices together with a 3D humanoid model at various positions within an array of indoor environments, then samples the images and properly reverberating audio when placing the receiver microphone and camera at other positions in the same house. This strategy allows sampling realistic audio-visual instances coupled with ground truth raw audio to train our model, and it has the added benefit of allowing controlled studies that vary the parameters of the capture setting. As we will show, the data also supports sim2real transfer for applying our model to real audio-visual observations.
Our main contributions are to 1) present the task of audio-visual dereverberation, 2) address it with a new multi-modal modeling approach and a novel reverb-visual matching loss, 3) provide a benchmark evaluation framework built on SoundSpaces as well as real data, and 4) demonstrate the utility of AV dereverberation for multiple practical tasks. We first train and evaluate our model on 82 large-scale real-world environments—each a multi-room home containing a variety of objects—coupled with speech samples from the LibriSpeech datasetlibrispeech. We consider both near-field and far-field settings where the human speaker is in-view or quite far from the camera, respectively. The proposed model outperforms methods restricted to the audio stream, and improves the state of the art for multiple tasks with speech enhancement. We also show that our model trained in simulation can transfer directly to real-world data. Overall, our work shows the potential for speech enhancement models to leverage room acoustics by seeing the 3D environment.
2 Related Work
Audio dereverberation and speech enhancement.
Audio dereverberation and speech enhancement have a long and rich literature neely_allen_1979; miyoshi_kaneda_1988; naylor_gaubitch_2010; reverb-challenge; benetsy_speech_enhancement. While dereverberation can be done with microphone arrays, we focus on single audio channel approaches, which require fewer assumptions about the input data
. Recent deep learning methods achieve promising results to dereverberatehan_learning_2015; wu2016; zhao2019; ernst_speech_2019; zhao_monaural_2020; su_hifi-gan_2020, denoise vondrick-denoising-speech-nips2019; fu_metricgan_2019; su_hifi-gan_2020, or separate hershey2016deep; stoller2018adversarial the audio stream using audio input alone, and such enhancements can improve downstream speech recognition ko2017study; reverb-challenge and speaker recognition xvector. Acoustic simulations can provide data augmentation during training reverb-challenge; han_learning_2015; ko2017study; zhao_monaural_2020. Accounting for environmental effects on reverb, some work targets “room-aware" deep audio features capturing reverberation properties (e.g., RT60) giri_improving_2015, or injects reverberation effects from a different room via acoustic matching su2020acoustic. To our knowledge, the only prior work drawing on the visual stream to infer dereverberated audio is limited to using lip regions on near-field faces to first separate out distractor sounds tan_audio-visual_2020, and does not model anything about the visual scene for dereverberation purposes. In contrast, our model accounts for the full visual scene, far-field speech sources, and even out-of-view speakers. Our approach is the first to learn visual room acoustics for dereverberation, and it yields state-of-the-art results with direct benefits for multiple downstream tasks.
Visual understanding of room acoustics.
The room impulse response (RIR) is the transfer function capturing the room acoustics for arbitrary source stimuli; once convolved with a sound waveform, it produces the sound of that source in the context of the particular physical space. While traditionally measured with specialized equipment in the room itself stan2002comparison; Holters_rir or else simulated with sound propagation models (e.g., geometric image_method; chen_soundspaces_2020 or wave-based Murphy_waveguide), recent work explores estimating an RIR from an input image using CNNs kon_estimation_2019 or conditional GANs singh_image2reverb_2021 in order to simulate reverberant sound for a given environment. Video-based methods have also explored ways to lift monaural audio into its spatialized (binaural, ambisonic) counterpart in order to create an immersive audio experience for human listeners morgado-2018; 25d-visual-sound; scene-aware-360. Such methods share our interest in learning visual properties of a scene that influence the audio channel. However, unlike any of the above methods, rather than generate spatialized audio to benefit human listeners in augmented or virtual reality, our goal is to dereverberate audio—removing the effects of the room acoustics— to benefit automatic speech analysis. In addition, prior methods use imagery taken at camera positions at an unknown offset from the microphone, i.e., conflating all RIRs for a scene with one image, which limits them to a coarse characterization of the environment jeub_binaural; jeub_air; open_air. In contrast, our data and model align the camera and microphone to capture novel fine-grained audio-visual properties, including the human speaker’s location with respect to the microphone when the speaker is in view.
Audio-visual navigation and realistic simulations.
Recent work in embodied AI explores how vision and sound together can help agents move intelligently in 3D environments. Driven in part by new tools for audio-visual (AV) simulations in realistic scanned environments chen_soundspaces_2020
, new research develops deep reinforcement learning approaches to train agents to navigate to sounding objectschen_soundspaces_2020; gan2019look; chen_savi_2021; chen_waypoints_2020, explore unmapped environments dean-curious-nips2020, or move around to better separate multiple overlapping sounds in a house majumder2021move2hear. Our work also leverages state-of-the-art AV simulations for learning, but our objective and models are entirely different. Rather than train virtual robots to move intelligently, our aim is to clean reverberant audio for better speech analysis.
Audio-visual learning from video.
Multi-modal video understanding has experienced a resurgence of work in the vision, audio, and machine learning literature in recent years. This includes exciting advances in self-supervised cross-modal feature learning from video morgado-spatial-nips2020; lorenzo-nips2020; korbar-nips2018; visual-echoes, localizing objects in video with both sight and sound hu-localize-nips2020, and audio-visual source separation ephrat2018looking; owens2018audio; gao2018objectSounds; zhao2019som; Afouras20audio-visual-objects; visual-voice. None of these methods address speech deverberation.
3 The Audio-Visual Dereverberation Task
We introduce the novel task of audio-visual dereverberation. In this task, a speaker (or other sound source) and a listener are situated in a 3D environment, such as the interior of a house. The speaker—whose location is unknown to the listener—produces a speech waveform . A superposition of the direct sound and the reverb is captured by the listener, denoted . The reverberant speech can be modeled as the convolution of the anechoic source waveform with the room impulse response (RIR) , i.e. neely_allen_1979. is a function of the environment’s geometry, the materials that make up the environment, and the relative positioning of the speaker and the listener. It is possible in principle to measure the RIR for a real-world environment, but doing so can be impractical when the source and listener are able to move around or must cope with different environments. Furthermore, in the common scenario where we want to process video captured in environments to which we have no physical access, measuring the RIR is simply impossible.
Crucially to our task, we consider an alternative source of information about the environment: vision. We assume the listener has an RGB-D observation of its surroundings, obtained from a RGB-D camera or an RGB camera coupled with single-image depth estimation Wichern_depth; monodepth2. Intuitively, we should be able to leverage the information about the environment’s geometry and material composition that is implicit in the visual stream—as well as the location of the speaker (if visible)—to estimate its reverberant characteristics. We anticipate that these cues can inform an estimate of the room acoustics, and thus the clean source waveform. Given the RGB and depth image captured by the listener from its current vantage point, the task is to predict the source waveform from the images and reverberant audio: . This setting represents common real-world scenarios previously discussed, and poses new challenges for speech enhancement and recognition.
4 Dataset Curation
For the proposed task, obtaining the right training data is itself a challenge. Existing video data contains reverberant audio but lacks the ground truth anechoic audio signal, and existing RIR datasets jeub_binaural; jeub_air; open_air do not have images paired with the microphone position. We introduce both real and simulated datasets to enable reproducible research on audio-visual deverberation.
3D environments and audio simulator.
First we introduce a large-scale dataset in which we couple real-world visual environments with state-of-the-art audio simulations accurately capturing the environments’ spatial effects on real samples of recorded speech. We want our dataset to allow control of a variety of physical environments, the positions of the listener/camera and sources, and the speech content of the sources—all while maintaining both the observed reverberant and ground truth anechoic sounds. To this end, we leverage the audio-visual simulator SoundSpaces chen_soundspaces_2020, which provides precomputed RIRs on a uniform grid of resolution 1 m for the real-world environment scans in Replica straub2019replica and Matterport3D Matterport3D. We use 82 Matterport environments due to their greater scale and complexity; each environment has multiple rooms spanning on average 517 m.
Reverberant speech in 3D visual environments.
We extend SoundSpaces to construct reverberant speech. As the source speech corpus we use LibriSpeech librispeech, which contains 1,000 hours of 16kHz read English speech from audio books, and is widely used in the speech recognition literature. We train our models with the train-clean-360 split, and use the dev-clean and test-clean sets for validation and test splits, respectively. Note that these splits have non-overlapping speaker identities. Similarly, we use the standard disjoint train/val/test splits for the Matterport 3D visual environments chen_soundspaces_2020. Thus, neither the houses nor speaker voices observed at test time are ever observed during training.
For each source utterance, we randomly sample a source-receiver location pair in a random environment, then convolve the speech waveform with the associated SoundSpaces RIR to obtain the reverberant . To augment the visual stream, we insert a 3D humanoid of the same gender as the real speaker at the speaker location and render RGB and depth images at the listener location. We consider two types of image: panorama and normal field of view (FoV). For the panorama image, we stitch 18 images each having a horizontal FoV of 20 degrees, for a full image resolution of . For the normal FoV, we render images with a 80 degree FoV, at a resolution of . While the panorama gives a fuller view of the environment and thus should allow the model to better estimate the room acoustics, the normal FoV is more common in existing video and thus will facilitate our model’s transfer to real data. See Fig. 2. We generate 49,430/2,700/2,600 such samples for the train/val/test splits, respectively. See Supp. materials for examples and details.
Real data collection.
To explore whether models trained in simulation can also work in the real world, we also collect a set of real images and speech recordings while preserving the ground truth anechoic audio.
To collect image data, we use an iPhone 11 camera to capture a panoramic RGB image and a monocular depth estimation algorithm monodepth2 to generate the corresponding depth image.
To record the audio, we use a ZYLIA ZM-1 microphone. We place both the camera and microphone at the same height (1.5m) as the SoundSpaces RIRs. For the source speech, we play utterances from the LibriSpeech test set through a loudspeaker held by a person facing the camera. We collect data from varying environments, including auditoriums, meeting rooms, atriums, corridors, and classrooms. For each environment, we vary the speaker location from near-field to mid-field to far-field. For each location, we play around 10 utterances. During data collection, the microphone also records ambient sounds like people chatting, door opening, AC humming, etc. In total, we collect 200 recordings. Please see the Supp. for audio-visual examples. We will publicly share all data and code.
We propose the Visually-Informed Dereverberation of Audio (VIDA) model, which leverages visual cues to learn representations of the environmental acoustics and sound source locations to dereverberate audio. While our model is agnostic to the audio source type, we focus on speech due to the importance of dereverberating speech for downstream analysis. VIDA consists of two main components (Figure 3): 1) a Visual Acoustics Network (VAN), which learns to map RGB-D images of the environment to features useful for dereverberation, and 2) the dereverberation module itself, which is based on a UNet encoder-decoder architecture. The UNet encoder takes as input a reverberant spectrogram, while the decoder takes the encoder’s output along with the visual dereverberation features produced by the VAN and reconstructs a dereverberated version of the audio.
Visual Acoustics Network.
Visual observations of a scene reveal information about room acoustics, including room geometry, materials, object locations, and the speaker position. We devise
the VAN to capture all these cues into a latent embedding vector, which is subsequently used to remove reverb. This network takes as its input an RGB imageand a depth image , captured from the listener’s current position within the environment. The depth image contains information about the geometry of the environment and arrangement of objects, while the RGB image contains more information about their material composition. To better model these different information sources, we use two separate ResNet18 resnet18 networks to extract their features, i.e. and . We concatenate and channel-wise and feed the result to a 1x1 convolution layer to reduce the number of total channels to 512 followed by a subsequent pooling layer to reduce the spatial dimension, resulting in the output vector .
To recover the clean speech audio, we use the UNet unet architecture, a fully convolutional network often used for image segmentation. We first use the Short-Time Fourier Transform (STFT) to convert the reverberant input audio to a complex spectrogram . We treat as a 2-dimensional, 2-channel image, where the horizontal dimension represents time, the vertical dimension represents frequency, and the two channels represent the log-magnitude and phase angle. Our UNet takes spectrograms of a fixed size of as input, but in general the duration of the speech audio we wish to dereverberate will be variable. Therefore, the model processes the full input spectrogram using a series of overlapping, sliding windows. Specifically, we segment the spectrogram along the time dimension into a sequence of fixed-size chunks using a sliding window of length frames and 50% overlap between consecutive windows to avoid boundary artifacts. To derive the ground-truth target spectrograms used in training, we perform the exact same segmentation operation on the clean source audio to obtain .
During training, when a particular waveform is selected for inclusion in a data batch, we randomly sample one of its segments to be the input to the model, and choose the corresponding as the target. We first compute the output of the VAN, , for the environment image associated with . Next, is fed to the UNet’s encoder to extract the intermediate feature map . We then spatially tile and concatenate depth-wise with , and feed the fused features to the UNet decoder, which predicts the source spectrogram segment .
Spectrogram prediction loss.
The primary loss function we use to train our model is the Mean-Squared Error (MSE) between the predicted and ground-truth spectrograms, treating the magnitude and phase separately. For a given predicted spectrogram segment, let denote the predicted log-magnitude spectrogram, denote the predicted phase spectrogram, and and denote the respective ground-truth magnitude and phase spectrograms. We define the magnitude loss as:
To address the issue of phase wraparound, we map the phase angle to its corresponding rectangular coordinates on the unit circle and then compute the MSE loss for the phase:
Reverb-visual matching loss.
To reinforce the consistency between the visually-inferred room acoustics and the reverberation characteristics learned by the UNet encoder, we also employ a contrastive reverb-visual matching loss:
Here, represents L2 distance, applies L2 normalization, is a margin, and is a random speech embedding sampled from the data batch. This loss forces the embeddings output by the VAN and the UNet encoder to be consistent, which we empirically show to be beneficial.
Our overall training objective (for a single training example) is as follows:
where and are weighting factors for the phase and matching losses. To augment the data, we further choose to rotate the image view for a random angle for each input during training. This is possible because our audio recording is omni-directional and is independent of camera pose. This data augmentation strategy prevents the model from overfitting; without it our model fails to converge. This strategy creates a one-to-many mapping between reverb and views, forcing the model to learn a viewpoint-invariant representation of the room acoustics.
At test time, we wish to re-synthesize the entire clean waveform instead of a single fixed-length segment. In this case, we feed all of the segments for a waveform into the model and temporally concatenate all of the output segments. Because consecutive segments overlap by 50%, during the concatenation step we only retain the middle 50% of the frames from each segment and discard the rest. Finally, to re-synthesize the waveform we use the Griffin-Lim algorithm griffin to iteratively improve the predicted phase for 30 iterations, which we find works better than directly using the predicted phase or using Griffin-Lim with a randomly initialized phase.
We evaluate our model by dereverberating speech for three downstream tasks: speech enhancement (SE), automatic speech recognition (ASR), and speaker verification (SV)
. We evaluate using both real scanned Matterport3D environments with sim audio as well as real-world data collected with a camera and mic. Please see Supp. for all hyperparameter settings and data preprocessing details.
Evaluation tasks and metrics.
We report the standard metrics Perceptual Evaluation of Speech Quality (PESQ) rix_perceptual_2001, Word Error Rate (WER), and Equal Error Rate (EER) for the three tasks, respectively. For ASR and SV, we use pretrained models from the SpeechBrain SB2021 toolkit. We evaluate these models off-the-shelf on our (de)reverberated version of the LibriSpeech test-clean set, and also explore finetuning the model on the (de)reverberated LibriSpeech train-clean-360 data. For performing speaker verification, we evaluate the model on a set of 80k randomly sampled utterance pairs from the test-clean set. Please see Supp. for more details.
In addition to evaluating the the clean and reverberant audio (with no enhancement), we compare against multiple baseline dereverberation models:
MetricGAN+ fu_metricgan_2021: a recently proposed state-of-the-art model for speech enhancement; we use the public implementation from SpeechBrain SB2021, trained on our dataset. Following the original paper, we optimize for PESQ during training, then choose the best-performing model snapshot (on the validation data) specific to each of our downstream tasks.
WPE wpe: A statistical speech dereverberation model that is commonly used for comparison.
Audio-only dereverb: An ablation of our proposed VIDA model that does not use any visual input or the proposed matching loss (i.e., the VAN is removed). It uses only the UNet encoder-decoder trained with the MSE loss for dereverberation; a similar model is proposed by ernst_speech_2019.
We emphasize that all baselines are audio-only models, as opposed to our proposed audio-visual model. Our multimodal dereverberation technique could extend to work in conjunction with other newly-proposed audio-only models, i.e., ongoing architecture advances are orthogonal to our idea.
|Speech Enhancement||Speech Recognition||Speaker Verification|
|PESQ||WER (%)||WER-FT (%)||EER (%)||EER-FT (%)|
|Clean (Upper bound)||4.64||2.50||2.50||1.62||1.62|
|MetricGAN+ fu_metricgan_2021||2.33 (+51%)||7.49 (+15%)||4.86 (-5%)||4.67 (+0.4%)||2.75 (+39%)|
|WPE wpe||1.63 (+6%)||8.18 (+8%)||4.30 (+7%)||5.19 (-11%)||4.48 (+2%)|
|Audio-only dereverb.||2.32 (+51%)||4.92 (+44%)||3.76 (+19%)||4.67 (+0.4%)||2.61 (+43%)|
|VIDA w/ normal FoV||2.33 (+51%)||4.85 (+45%)||3.73 (+19%)||4.53 (+3%)||2.79 (+39%)|
|VIDA w/o matching loss||2.38 (+55%)||4.59 (+48%)||3.72 (+19%)||4.02 (+14%)||2.62 (+43%)|
|VIDA w/o human mesh||2.31 (+50%)||4.57 (+48%)||3.72 (+19%)||4.00 (+15%)||2.52 (+45%)|
|VIDA||2.37 (+54%)||4.44 (+50%)||3.66 (+21%)||3.99 (+15%)||2.40 (+47%)|
Results in scanned environments.
Table 1 shows the results for all models on SE, ASR, and SV. First, since existing methods report results on clean non-reverberant audio, we note the pretrained SpeechBrain model applied to clean audio (first row) yields errors competitive with the SoTA conformer, meaning we have a solid experimental testbed. Comparing the results on clean vs. reverberated speech, we see that reverberation significantly degrades performance on all tasks. Our VIDA model outperforms all other models, and by a large margin on the ASR and SV tasks: without finetuning, we achieve absolute improvements of 0.04 PESQ (1.71% relative improvement), 0.48% WER (9.75% relative improvement), and 0.68% EER (14.56% relative improvement) over the best baseline in each case (which happens to be the audio-only version of VIDA for both the ASR and SV tasks). After finetuning the ASR and SV models, the gains are still largely preserved at 0.1% WER (2.66% relative), and 0.21% EER (8.03% relative), although it is important to note that finetuning the downstream models on the enhanced speech is not always feasible. Our results demonstrate that learning the acoustic properties of an environment from visual signals is very helpful for dereverberating speech, enabling the model to leverage information about the environment unavailable in the audio alone.
To understand how the WER of our model changes as the input difficulty increases, we plot the cumulative WER against (a) the PESQ of the input speech and (b) the speaker distance from the camera in Fig. 3(a) and Fig. 3(b), respectively. Our VIDA model consistently outperforms the Audio-only baseline on all difficulty levels. Importantly, when the input sample is more difficult (low PESQ or far distance), our model shows a greater performance gain.
To understand how well VIDA works with a normal field-of-view (FoV) camera, we replace the panorama image input with a FoV of 80 degrees randomly sampled from the current view. Table 1 shows the results. All metrics drop compared to using a panorama. This is expected, because the model is limited in what it can see with a narrower field of view; the inferred room acoustics are impaired by not seeing the full environment or missing where the speaker is. Compared to the audio-only dereverberation model, however, VIDA applied to narrow FoV data still performs better; even a partial view of the environment helps the model understand the scene and dereverberate the audio. The fact the two variants (panorama vs. reduced FoV) perform differently also reinforces that our model is learning meaningful visual representations.
Next, we ablate the proposed reverb-visual matching loss (“w/o matching loss"). Without it, VIDA’s performance declines on all metrics. This shows by forcing the visual feature to agree with the reverberation feature, our model learns a better representation of room acoustics.
To examine how much the model leverages the human speaker cues, we evaluate our trained VIDA model on the same test data but with the 3D humanoid removed (“w/o human mesh"). All three metrics become worse. This shows our model pays attention to the location of the human speaker to learn a better representation of the anticipated reverberation.
|Speech Enhancement||Speech Recognition||Speaker Verification|
|PESQ||WER (%)||EER (%)|
|Clean (Upper bound)||4.64||2.52||1.42|
|MetricGAN+ fu_metricgan_2021||1.62 (+33%)||21.42 (-16%)||5.70 (-46%)|
|Audio-only dereverb.||1.41 (+16%)||15.18 (+17%)||4.24 (-8%)|
|VIDA w/ normal FoV||1.44 (+18%)||14.71 (+20%)||3.79 (+3%)|
|VIDA||1.49 (+22%)||13.02 (+29%)||3.75 (+4%)|
|Near-field||14.10 / 8.97||0.91 / 0.91||4.98 / 6.47||6.14 / 5.26||2.15 / 1.79|
|Mid-field||21.78 / 18.94||5.06 / 6.32||7.67 / 7.67||2.56 / 1.47||7.27 / 4.36|
|Far-field||52.38 / 50.52||10.44 / 7.46||21.95 / 6.71||5.91 / 6.82||25.23 / 21.10|
Robustness to noise and ambient sounds.
We next test the robustness of our model by adding ambient sounds from urban environments such as coffee shops, restaurants, and bars using the WHAM dataset Wichern2019WHAM. We add them to the reverberant test waveform with a SNR of 20, following ernst_speech_2019; wpe. We show our model is robust to this noise; see Supp. for details.
Results on real data.
Next, we deploy our model in the real world. We use all models trained in simulation to dereverberate the real-world dataset (cf. Sec. 4) before using the finetuned ASR/SV models to evaluate the enhanced speech. Table 2 shows the results of all models on real data. Reverberation does more damage to the WER compared to in simulation. Although MetricGAN+ fu_metricgan_2021 has surprisingly better PESQ, it has a weak WER score. Our VIDA model again outperforms all baselines on ASR and SV. This demonstrates the realism of the simulation and the capability of our trained model to directly transfer to real-world data, a promising step for VIDA’s wider applicability.
Table 3 breaks down the ASR performance for Audio-only dereverb. (the best baseline) and our VIDA model by environment type and speaker distance. The atrium is quite reverberant due to the large space. Although the auditorium is similarly large, the space is designed to reduce reverberation and thus both models have lower WER. The meeting room and the classroom have smaller sizes and are comparatively less reverberant. The corridor only becomes reverberant when the speaker is far away. VIDA outperforms the Audio-only dereverb. baseline in most cases, especially in highly reverberant ones. See the Supp. video for examples.
Analyzing learned features.
Figure 3(c) and 3(d) analyze our model’s learned audio and visual features via 2D TSNE projections tsne. For each sample, we color the point according to either (c) the ground truth distance between the camera/microphone and the human speaker or (d) the reverberation time for the audio signal to decay by 60 dB (known as the RT60). Neither of these variables are available to our model during training, yet when learning to perform deverberation, our model exposes these high-level properties relevant to the audio-visual task. Consistent with the quantitative results above, this analysis shows how our model captures elements of the visual scene, room geometry, and speaker location that are valuable to proper dereverberation.
Figure 5 shows a simulated and real-world example. As we can see, the reverberant spectrogram is much blurrier compared to the clean spectrogram, while our predicted spectrogram removes those reverberations by leveraging the visual cues of room acoustics.
We introduced the novel task of audio-visual dereverberation. The proposed VIDA approach learns to remove reverb by attending to both the audio and visual streams, recovering valuable signals about room geometry, materials, and speaker locations from visual encodings of the environment. In support of this task, we develop a large-scale dataset providing realistic, spatially registered observations of speech and 3D environments. Our results show VIDA successfully dereverberates novel voices in novel environments more accurately than an array of baselines, improving multiple downstream tasks. Furthermore, having trained in the realistic simulator, our model also has promise to enhance speech in real-world data, although more work is needed to bring down the absolute error rates. In future work, we plan to explore temporal models for audio-visual dereverberation with video and introduce active perception for camera control.
8 Supplementary Materials
In this supplementary material, we provide additional details about:
Video (with audio) for demos of the collected data as well as qualitative assessment of VIDA’s performance.
Implementation details of our model and data pre-processing.
Evaluation details of downstream tasks and corresponding metrics.
Robustness to noise and ambient sound.
Ablation on visual sensors.
8.1 Qualitative Video
This video includes examples for audio-visual data in simulation and in the real-world. We demonstrate examples of our dereverbration model applied to these inputs. The video is available at https://youtu.be/zPeAjlwo6XA.
8.2 Implementation Details
For the STFT calculation, we sample the input audio at 16 kHz and use a Hamming window of size 400 samples (25 milliseconds), a hop length of 160 samples (10 milliseconds), and a 512-point FFT. By retaining only the positive frequencies and segmenting the spectrograms into 256-frame chunks (corresponding to approximately 2.5 seconds of sound), the final audio input size to our UNet is 256x256. We use the Adam optimizer kingma2014adam to train our model with . We decay the learning rate exponentially to
in 150 epochs. We set the batch size to 96 and train all models for 150 epochs, which is long enough to reach convergence. We set the marginto 0.5, phase loss weight to 0.08 and matching loss weight to 0.001.
8.3 Evaluation Details
We evaluate our model on three tasks: speech enhancement (SE), automatic speech recognition (ASR), and speaker verification (SV).
For SE, the goal is to improve the overall sonic quality of the speech signal, which we measure automatically using the standard Perceptual Evaluation of Speech Quality (PESQ) rix_perceptual_2001 metric.
For ASR, the goal is to automatically transcribe the sequence of words that were spoken in the audio recording. For this task, we report the Word Error Rate (WER), which is the standard metric used in ASR and reflects a word-level edit distance between a recognizer’s output and the ground-truth transcription.
For SV, the goal is to detect whether or not two different spoken utterances were spoken by the same speaker. For SV, we report the Equal Error Rate (EER), a standard metric in the SV field indicating the point on the Detection Error Tradeoff (DET) curve where the false alarm and missed detection probabilities are equivalent.
Since the spectrogram MSE loss we optimize during training does not perfectly correlate with these three task-specific metrics, we perform model selection (across snapshots saved each training epoch) by computing the task-specific evaluation metric on 500 validation samples. We then select the best model snapshot independently for each downstream task and evaluate on the held-out test set; the same model selection procedure is also used for all of our baseline models.
For the ASR and SV tasks, we use the SpeechBrain SB2021 toolkit. For ASR, we use the HuggingFace Transformer vaswani2017attention + Transformer LM model pre-trained on LibriSpeech librispeech. We evaluate this model off-the-shelf on our (de)reverberated version of the LibriSpeech test-clean set, and also explore fine-tuning the model on the (de)reverberated LibriSpeech train-clean-360 data. For the SV task, we use SpeechBrain’s ECAPA-TDNN embedding model DesplanquesTD20, pre-trained on VoxCeleb Nagrani17
. For performing verification, we evaluate the model on a set of 80k randomly sampled utterance pairs from the test-clean set (40k same-speaker pairs, 40k different-speaker pairs) using the cosine similarity-based scoring pipeline from SpeechBrain’s VoxCeleb recipe. In the verification task, we use the clean (non-reverberated) speech as the reference utterance, and the reverberant speech as the test utterance. As in the ASR task, we evaluate this model on our dereverberation model’s outputs both off-the-shelf, as well as after fine-tuning on the (de)reverberated train-clean-360 set.
8.4 Robustness to Noisy Audio
We test the robustness of our model by adding ambient sounds from urban environments such as coffee shops, restaurants, and bars using the WHAM dataset Wichern2019WHAM. We add them to the reverberant test waveform with a SNR of 20, following ernst_speech_2019; wpe. Table 4 shows the results on three downstream tasks. As expected, all models’ performance drop compared to the results in the noise-free test setting (Table 1), but our VIDA model still significantly outperforms the baselines on ASR. For speaker verification, WPE wpe is reported to be robust to noisy input and thus has lower EER while using the pretrained model, but it underperforms VIDA when the SV model is finetuned on the enhanced speech. Noise has less impact on the performance on MetricGAN+fu_metricgan_2021 likely because it directly optimizes PESQ.
|Speech Enhancement||Speech Recognition||Speaker Verification|
|PESQ||WER (%)||WER-FT (%)||EER (%)||EER-FT (%)|
|Clean (Upper bound)||4.64||2.50||2.50||1.62||1.62|
|MetricGAN+ fu_metricgan_2021||2.12 (+57%)||9.40 (+23%)||7.09 (-11%)||4.94 (-5%)||3.38 (+34%)|
|WPE wpe||1.39 (+2%)||11.32 (+8%)||7.00 (-10%)||4.48 (+4%)||4.95 (+3%)|
|Audio-only dereverb.||1.76 (+29%)||7.37 (+40%)||5.52 (+14%)||5.75 (-23%)||3.58 (+30%)|
|VIDA w/ normal FoV||1.76 (+29%)||7.51 (+39%)||5.51 (+14%)||5.54 (-18%)||3.40 (+33%)|
|VIDA w/o matching loss||1.81 (+33%)||6.76 (+45%)||5.31 (+17%)||4.95 (-6%)||3.26 (+36%)|
|VIDA||1.82 (+34%)||6.53 (+47%)||5.29 (+17%)||4.83 (-3%)||3.13 (+39%)|
Results on the LibriSpeech test-clean set mixed with ambient sounds at a 20 dB signal-to-noise ratio.
8.5 Ablation on Visual Sensors
To understand the importance of each input sensor, we ablate the RGB and depth input as shown in Table 5. Dropping either RGB or depth makes the WER worse. We hypothesize that this is because they contain distinct information for the learning of room acoustics. The depth image is better for capturing room geometry, while the RGB image is better for capturing material and speaker location information.
In addition, we perform early fusion of RGB and Depth images by stacking them along the channel dimension (w/ early fusion in Table 5) and use one ResNet18 resnet18 model instead of two. This method also has worse WER, which validates our design choice of extracting RGB and depth features separately.
|Speech Enhancement||Speech Recognition||Speaker Verification|
|PESQ||WER (%)||EER (%)|
|Audio-only dereverb.||2.32 (+51%)||4.92 (+44%)||4.67 (+0.4%)|
|VIDA w/o RGB||2.38 (+55%)||4.76 (+46%)||3.82 (+19%)|
|VIDA w/o depth||2.38 (+55%)||4.52 (+49%)||3.99 (+15%)|
|VIDA w/ early fusion||2.38 (+55%)||4.56 (+48.5%)||3.94 (+16%)|
|VIDA||2.37 (+54%)||4.44 (+50%)||3.99 (+15%)|