We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).READ FULL TEXT VIEW PDF
Humans are remarkably capable of focusing their auditory attention on a single sound source within a noisy environment, while de-emphasizing (“muting”) all other voices and sounds. The way neural systems achieve this feat, which is known as the cocktail party effect [Cherry, 1953], remains unclear. However, research has shown that viewing a speaker’s face enhances a person’s capacity to resolve perceptual ambiguity in a noisy environment [Ma et al., 2009; Golumbic et al., 2013]. In this paper we achieve this ability computationally.
Automatic speech separation—separating an input audio signal into its individual speech sources—is well-studied in the audio processing literature. Since this problem is inherently ill-posed, it requires prior knowledge or special microphone configurations in order to obtain a reasonable solution [McDermott, 2009]. In addition, a fundamental problem with audio-only speech separation is the label permutation problem [Hershey et al., 2016]: there is no easy way to associate each separated audio source with its corresponding speaker in the video [Yu et al., 2017; Hershey et al., 2016].
In this work, we present a joint audio-visual method for “focusing” audio on a desired speaker in a video. The input video can then be recomposed such that the audio corresponding to specific people is enhanced while all other sound is suppressed (Fig. 1
). More specifically, we design and train a neural network-based model that takes the recorded sound mixture, along with tight crops of detected faces in each frame in the video as input, and splits the mixture into separate audio streams for each detected speaker. The model uses visual information both as a means to improve the source separation quality (compared to audio-only results), as well as to associate the separated speech tracks with visible speakers in the video. All that is required from the user is to specify which faces of the people in the video they want to hear the speech from.
To train our model, we collected 290,000 high-quality lectures, TED talks and how-to videos from YouTube, then automatically extracted from these videos roughly 4700 hours of video clips with visible speakers and clean speech with no interfering sounds (Fig. 2). We call our new dataset AVSpeech. With this dataset in hand, we then generated a training set of “synthetic cocktail parties”—mixtures of face videos with clean speech, and other speech audio tracks and background noise.
We demonstrate the benefits of our approach over recent speech separation methods in two ways. First, we show superior results compared to a state-of-the-art audio-only method on pure speech mixtures. Second, we demonstrate our model’s capability of producing enhanced sound streams from mixtures containing both overlapping speech and background noise in real-world scenarios.
To summarize, our paper makes two main contributions: (a) An audio-visual speech separation model that outperforms audio-only and audio-visual models on classic speech separation tasks, and is applicable in challenging, natural scenes. To our knowledge, our paper is the first to propose a speaker-independent audio-visual model for speech separation. (b) A new, large-scale audio-visual dataset, AVSpeech, carefully collected and processed, comprised of video segments where the audible sound belongs to a single person, visible in the video, and no audio background interference. This dataset allows us to achieve state-of-the-art results on speech separation and may be useful for the research community for further studies. Our dataset, input and output videos, and additional supplementary materials are all available on the project web page: http://looking-to-listen.github.io/.
We briefly review related work in the areas of speech separation and audio-visual signal processing.
Speech separation is one of the fundamental problems in audio processing and has been the subject of extensive study over the last decades. Wang and Chen  give a comprehensive overview of recent audio-only methods based on deep learning that tackle both speech denoising [Weninger et al., 2015; Erdogan et al., 2015] and speech separation tasks.
Two recent works have emerged which solve the aforementioned label permutation problem to perform speaker-independent, multi-speaker separation in the single-channel case. Hershey et al.  propose a method called deep clustering in which discriminatively-trained speech embeddings are used to cluster and separate the different sources. Hershey et al.  also introduced the idea of a permutation-free or permutation invariantloss function, but they did not find that it worked well. Isik et al.  and Yu et al.  subsequently introduced methods which successfully use a permutation invariant loss function to train a DNN.
The advantages of our approach over such audio-only methods are threefold: First, we show that the separation results of our audio-visual model are of higher quality than those of a state-of-the-art-inspired audio-only model. Second, our approach performs well in the setting of multiple speakers mixed with background noise, which, to our knowledge, no audio-only method has satisfactorily solved. Third, we jointly solve two speech processing problems: speech separation, and assignment of a speech signal to its corresponding face, which, thus far, have been tackled separately [Hoover et al., 2017; Hu et al., 2015; Monaci, 2011].
There is increased interest in using neural networks for multi-modal fusion of auditory and visual signals to solve various speech-related problems. These include audio-visual speech recognition [Ngiam et al., 2011; Mroueh et al., 2015; Feng et al., 2017], predicting speech or text from silent video (lipreading) [Ephrat et al., 2017; Chung et al., 2016]
, and unsupervised learning of language from visual and speech signals[Harwath et al., 2016]. These methods leverage natural synchrony between simultaneously recorded visual and auditory signals.
Audio-visual (AV) methods have also been used for speech separation and enhancement [Hershey and Casey, 2002; Hershey et al., 2004; Rivet et al., 2014; Khan, 2016]. Casanovas et al.  perform AV source separation using sparse representations, which is limited due to dependence on active-alone regions to learn source characteristics, and the assumption that all the audio sources are seen on-screen. Recent methods have used neural networks to perform the task. Hou et al.  propose a multi-task CNN-based model which outputs a denoised speech spectrogram as well a reconstruction of the input mouth region. Gabbay et al.  train a speech enhancement model on videos where other speech samples of the target speaker are used as background noise, in a scheme they call “noise-invariant training”. In concurrent work, Gabbay et al.  use a video-to-sound synthesis method to filter noisy audio.
The main limitation of these AV speech separation approaches is that they are speaker-dependent, meaning a dedicated model must be trained for each speaker separately. While these works make specific design choices that limit their applicability only to the speaker-dependent case, we speculate that the main reason a speaker-independent AV model hasn’t been pursued widely so far is the lack of a sufficiently large and diverse dataset for training such models — a dataset like the one we construct and provide in this work. To the best of our knowledge, our paper is the first to address the problem of speaker-independent AV speech separation. Our model is capable of separating and enhancing speakers it has never seen before, speaking in languages that were not part of the training set. In addition, our work is unique in that we show high quality speech separation on real world examples, in settings that previous audio-only and audio-visual speech separation work did not address.
A number of independent and concurrent works have recently emerged which address the problem of audio-visual sound source separation using deep neural networks. [Owens and Efros, 2018]
train a network to predict whether audio and visual streams are temporally aligned. Learned features extracted from this self-supervised model are then used to condition an on/off screen speakers source separation model.Afouras et al.  perform speech enhancement by using a network to predict both magnitude and phase of denoised speech spectrograms. Zhao et al.  and Gao et al.  addressed the closely related problem of separating the sound of multiple on-screen objects (e.g. musical instruments).
Most existing AV datasets comprise videos with only a small number of subjects, speaking words from a limited vocabulary. For example, the CUAVE dataset [Patterson et al., 2002] contains 36 subjects saying each digit from to five times each, with a total of 180 examples per digit. Another example is the Mandarin sentences dataset, introduced by Hou et al. , which contains video recordings of 320 utterances of Mandarin sentences spoken by a native speaker. Each sentence contains 10 Chinese characters with equally distributed phonemes. The TCD-TIMIT dataset [Harte and Gillen, 2015] consists of 60 volunteer speakers with around 200 videos each. The speakers recite various sentences from the TIMIT dataset [S Garofolo et al., 1992], and are recorded using both front-facing and -degree cameras. We evaluate our results on these three datasets in order to compare to previous work.
Recently, the large-scale Lip Reading Sentences (LRS) dataset was introduced by Chung et al. , which includes both a wide variety of speakers and words from a larger vocabulary. However, not only is that dataset not publicly available, but the speech in LRS videos is not guaranteed to be clean, which is crucial for training a model for speech separation and enhancement.
We introduce a new, large-scale audio-visual dataset comprising speech clips with no interfering background signals. The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses. Representative frames, audio waveforms and some dataset statistics are shown in Figure 2.
We collected the dataset automatically, since for assembling a corpus of this magnitude it was important not to rely on substantial human feedback. Our dataset creation pipeline collected clips from roughly 290,000 YouTube videos of lectures (e.g. TED talks) and how-to videos. For such channels, most of the videos comprise a single speaker, and both the video and audio are generally of high quality.
Our dataset collection process has two main stages, as illustrated in Figure 3. First, we used the speaker tracking method of Hoover et al.  to detect video segments of a person actively speaking with their face visible. Face frames that were blurred, insufficiently illuminated or had extreme pose were discarded from the segments. If more than 15% of a segment’s face frames were missing, it was discarded altogether. We used Google Cloud Vision API111https://cloud.google.com/vision/ for the classifiers in this stage, and to compute the statistics in Figure 2.
The second step in building the dataset is refining the speech segments to include only clean, non-interfered speech. This is a crucial component because such segments serve as ground truth during training. We perform this refinement step automatically by estimating the speech SNR (the log ratio of the main speech signal to the rest of the audio signal) of each segment as follows.
We used a pre-trained audio-only speech denoising network to predict the SNR of a given segment using the denoised output as an estimation of the clean signal. The architecture of this network is the same as the one implemented for the audio-only speech enhancement baseline in Section 5, and it was trained on speech from the LibriVox collection of public domain audio books. Segments for which the estimated SNR is below a threshold were rejected. The threshold was set empirically using synthetic mixtures of clean speech and non-speech interfering noise at different, known SNR levels.222Such mixtures simulate well the type of interference in our dataset, which typically involves a single speaker interfered by non-speech sounds like audience clapping or intro music. These synthetic mixtures were fed into the denoising network and the estimated (denoised) SNR was compared to the ground truth SNR (see Figure. 3(b)).
We found that at low SNRs, on average, the estimated SNR is very accurate, thus can be considered a good predictor of the original noise level. At higher SNRs (i.e. segments with little-to-no interference of the original speech signal), the accuracy of this estimator diminishes because the noise signal is faint. The threshold at which this occurs is at around 17 dB, as can be seen in Figure 3(b). We listened to a random sample of 100 clips which passed this filtering, and found that none of them contained noticeable background noise. We provide sample video clips from our dataset in the supplementary material.
At a high-level, our model is comprised of a multi-stream architecture which takes visual streams of detected faces and noisy audio as input, and outputs complex spectrogram masks, one for each detected face in the video (Figure 4). The noisy input spectrograms are then multiplied by the masks to obtain an isolated speech signal for each speaker, while suppressing all other interfering signals.
Our model takes both visual and auditory features as input. Given a video clip containing multiple speakers, we use an off-the-shelf face detector (e.g. Google Cloud Vision API) to find faces in each frame (75 face thumbnails altogether per speaker, assuming 3-second clips at 25 FPS). We use a pretrained face recognition model to extract one face embedding per frame for each of the detected face thumbnails. We use the lowest layer in the network that is not spatially varying, similar to the one used by Cole et al.  for synthesizing faces. The rationale for this is that these embeddings retain information necessary for recognizing millions of faces, while discarding irrelevant variation between images, such as illumination. In fact, recent work also demonstrated that it is possible to recover facial expressions from such embeddings [Rudd et al., 2016]. We also experimented with raw pixels of the face images, which did not lead to improved performance.
As for the audio features, we compute the short-time Fourier transform (STFT) of 3-second audio segments. Each time-frequency (TF) bin contains the real and imaginary parts of a complex number, both of which we use as input. We perform power-law compression to prevent loud audio from overwhelming soft audio. The same processing is applied to both the noisy signal and the clean reference signal.
At inference time, our separation model can be applied to arbitrarily long segments of video. When more than one speaking face is detected in a frame, our model can accept multiple face streams as input, as we will discuss shortly.
The output of our model is a multiplicative spectrogram mask, which describes the time-frequency relationships of clean speech to background interference. In previous work [Wang et al., 2014; Wang and Chen, 2017], multiplicative masks have been observed to work better than alternatives such as direct prediction of spectrogram magnitudes or direct prediction of time-domain waveforms. Many types of masking-based training targets exist in the source separation literature [Wang and Chen, 2017], of which we experiment with two: ratio mask (RM) and complex ratio mask (cRM).
The ideal ratio mask is defined as the ratio between the magnitudes of the clean and noisy spectrograms, and is assumed to lie between 0 and . The complex ideal ratio mask is defined as the ratio of the complex clean and noisy spectrograms. The cRM has a real component and an imaginary component, which are separately estimated in the real domain. Real and imaginary parts of the complex mask will typically lie between -1 and 1, however, we use sigmoidal compression to bound these complex mask values between 0 and 1 [Wang et al., 2016].
When masking with cRM, denoised waveforms are obtained by performing inverse STFT (ISTFT) on the complex multiplication of the predicted cRM and noisy spectrogram. When using RM, we perform ISTFT on the point-wise multiplication of the predicted RM and noisy spectrogram magnitude, combined with the noisy original phase [Wang and Chen, 2017].
Given multiple detected speakers’ face streams as input, the network outputs a separate mask for each speaker, and one for background interference. We perform most of our experiments using cRM, as we found that output speech quality using it was significantly better than RM. See Table 6 for a quantitative comparison of the two methods.
Fig. 4 provides a high-level overview of the various modules in our network, which we will now describe in detail.
The audio stream part of our model consists of dilated convolutional layers, the parameters of which are specified in Table 1.
The visual stream of our model is used to process the input face embeddings (see Section 4.1), and consists of dilated convolutions as detailed in Table 2. Note that “spatial” convolutions and dilations in the visual stream are performed over the temporal axis (not over the 1024-D face embedding channel).
To compensate for the sampling rate discrepancy between the audio and video signals, we upsample the output of the visual stream to match the spectrogram sampling rate (
Hz). This is done using simple nearest neighbor interpolation in the temporal dimension of each visual feature.
The audio and visual streams are combined by concatenating the feature maps of each stream, which are subsequently fed into a BLSTM followed by three FC layers. The final output consists of a complex mask (two-channels, real and imaginary) for each of the input speakers. The corresponding spectrograms are computed by complex multiplication of the noisy input spectrogram and the output masks. The squared error (L2) between the power-law compressed clean spectrogram and the enhanced spectrogram is used as a loss function to train the network. The final output waveforms are obtained using ISTFT, as described in Section 4.1.
Our model supports isolation of multiple visible speakers in a video, each represented by a visual stream, as illustrated in Fig. 4. A separate, dedicated model is trained for each number of visible speakers, e.g. a model with one visual stream for one visible speaker, double visual stream model for two, etc. All the visual streams share the same weights across convolutional layers. In this case, the learned features from each visual stream are concatenated with the learned audio features before continuing on to the BLSTM. It should be noted that in practice, a model which takes a single visual stream as input can be used in the general case in which either the number of speakers is unknown, or a dedicated multi-speaker model is unavailable.
Our network is implemented in TensorFlow, and its included operations are used for performing waveform and STFT transformations. ReLU activations follow all network layers except for last (mask), where a sigmoid is applied. Batch normalization[Ioffe and Szegedy, 2015] is performed after all convolutional layers. Dropout is not used, as we train on a large amount of data and do not suffer from overfitting. We use a batch size of 6 samples and train with Adam optimizer for 5 million steps (batches) with a learning rate of which is reduced by half every 1.8 million steps.
All audio is resampled to 16kHz, and stereo audio is converted to mono by taking only the left channel. STFT is computed using a Hann window of length 25ms, hop length of 10ms, and FFT size of 512, resulting in an input audio feature of scalars. Power-law compression is performed with (, where is the input/output audio spectrogram).
We resample the face embeddings from all videos to 25 frames-per-second (FPS) before training and inference by either removing or replicating embeddings. This results in an input visual stream of 75 face embeddings. Face detection, alignment and quality assessment is performed using the tools described by Cole et al. 
. When missing frames are encountered in a particular sample, we use a vector of zeros in lieu of a face embedding.
We tested our method in a variety of conditions and also compared our results to state-of-the-art audio-only (AO) and audio-visual (AV) speech separation and enhancement, both quantitatively and qualitatively.
There are no publicly available state-of-the-art audio-only speech enhancement/separation systems, and relatively few publicly available datasets for training and evaluating audio-only speech enhancement. And although there is extensive literature on “blind source separation” for audio-only speech enhancement and separation [Comon and Jutten, 2010], most of these techniques require multiple audio channels (multiple microphones), and are therefore not applicable to our task. For these reasons, we implemented an AO baseline for speech enhancement which has a similar architecture to the audio stream in our audio-visual model (Fig. 4, when stripping out the visual streams). When trained and evaluated on the CHiME-2 dataset [Vincent et al., 2013], which is widely used for speech enhancement work, our AO baseline achieved a signal-to-distortion ratio of 14.6 dB, nearly as good as the state-of-the-art single channel result of 14.75 dB reported by Erdogan et al. . Our AO enhancement model is therefore deemed a near state-of-the-art baseline.
In order to compare our separation results to those of a state-of-the-art AO model, we implemented the permutation-invariant training introduced by Yu et al. . Note that speech separation using this method requires a priori knowledge of the number of sources present in the recording, and also requires manual assignment of each output channel to the face of its corresponding speaker in the video (which our AV method does automatically).
Since existing AV speech separation and enhancement methods are speaker dependent, we could not easily compare to them in our experiments on synthetic mixtures (Section 5.1), or run them on our natural videos (Section 5.2). However, we show quantitative comparisons with those methods on existing datasets by running our model on videos from those papers. We discuss this comparison in more detail in Section 5.3. In addition, we show qualitative comparisons in our supplementary material.
|1S+Noise||2S clean||2S+Noise||3S clean|
|AO [Yu et al., 2017]||16.0||8.6||10.0||8.6|
|AV - 1 face||16.0||9.9||10.1||9.1|
|AV - 2 faces||-||10.3||10.6||9.1|
|AV - 3 faces||-||-||-||10.0|
We generated data for several different single-channel speech separation tasks. Each task requires its own unique configuration of mixtures of speech and non-speech background noise. We describe below the generation procedure for each variant of training data, as well as the relevant models for each task, which were trained from scratch.
In all cases, clean speech clips and corresponding faces are taken from our AVSpeech (AVS) dataset. Non-speech background noise is obtained from AudioSet [Gemmeke et al., 2017], a large-scale dataset of manually-annotated segments from YouTube videos. Separated speech quality is evaluated using signal-to-distortion ratio (SDR) improvement from the BSS Eval toolbox [Vincent et al., 2006], a commonly used metric for evaluating speech separation quality (see Section A in the Appendix).
We extracted 3-second non-overlapping segments from the varying-length segments in our dataset (e.g. a 10-sec segment would contribute 3 3-second segments). We generated 1.5 million synthetic mixtures for all the models and experiments. For each experiment, 90% of the generated data was taken to be the training set, and the remaining 10% was used as the test set. We did not use any validation set as no parameter tuning or early stopping were performed.
This is a classic speech enhancement task, for which the training data was generated by a linear combination of unnormalized clean speech and AudioSet noise: where is one utterance from AVS, is one segment from AudioSet with its amplitude multiplied by 0.3, and is a sample in the generated dataset of synthetic mixtures. Our audio only model performs quite well in this case, because the characteristic frequencies of noise are typically well separated from the characteristic frequencies of speech. Our audio-visual (AV) model performs as well as the audio-only (AO) baseline with SDR of 16 dB (first column of Table 3).
The dataset for this two-speaker separation scenario was generated by mixing clean speech of two different speakers from our AVS dataset: , where and are clean speech samples from different source videos in our dataset, and is a sample in the generated dataset of synthetic mixtures. We trained two different AV models on this task, in addition to our AO baseline:
() A model which takes only one visual stream as input, and outputs only its corresponding denoised signal. In this case, at inference, the denoised signal of each speaker is obtained by two forward passes in the network (one for each speaker). Averaging the SDR results of this model gives an improvement of 1.3 dB over our AO baseline (second column of Table 3).
() A model which takes visual information from both speakers as input, in two separate streams (as explained in Section 4). In this case, the output consists of two masks, one for each speaker, and inference is done with a single forward pass. An additional boost of 0.4 dB is obtained using this model, resulting in a 10.3 dB total SDR improvement. Intuitively, jointly processing two visual streams provides the network with more information and imposes more constraints on the separation task, hence improving the results.
Fig. 5 shows the SDR improvement as a function of input SDR for this task, for both the audio-only baseline and our two-speaker audio-visual model.
Here, we consider the task of isolating one speaker’s voice from a mixture of two speakers and non-speech background noise. To the best of our knowledge, this audio-visual task has not been addressed before. The training data was generated by mixing clean speech of two different speakers (as generated for the 2S clean task) with background noise from AudioSet: .
In this case we trained the AO network with three outputs, one for each speaker and one for background noise. In addition, we trained two different configurations of our model, with one and two visual streams received as input. The configuration of the one-stream AV model is the same as in model () in the previous experiment. The two-stream AV outputs three signals, one for each speaker and one for background noise. As can be seen in Table 3 (third column), the SDR gain of our one-stream AV model over the audio only baseline is 0.1 dB, and 0.5 dB for two streams, bringing the total SDR improvement to 10.6 dB. Fig. 6 shows the inferred masks and output spectrograms for a sample segment from this task, along with its noisy input and ground truth spectrograms.
The dataset for this task is created by mixing clean speech from three different speakers: . In a similar manner to the previous tasks, we trained our AV model with one, two and three visual streams as input, which output one, two and three signals, respectively.
We found that even when using a single visual stream, the AV model performs better than the AO model, with a 0.5 dB improvement over it. The two visual stream configuration gives the same improvement over the AO model, while using three visual streams leads to a gain of 1.4 dB, attaining a total 10 dB SDR improvement (fourth column of Table 3).
Many previous speech separation methods show a drop in performance when attempting to separate speech mixtures containing same-gender speech [Hershey et al., 2016; Delfarah and Wang, 2017]. Table 4 shows a breakdown of our separation quality by the different gender combinations. Interestingly, our model performs best (by a small margin) on female-female mixtures, but performs well on the other combinations as well, demonstrating its gender robustness.
In order to demonstrate our model’s speech separation capabilities in real-world scenarios, we tested it on an assortment of videos containing heated debates and interviews, noisy bars and screaming children (Fig. 7). In each scenario we use a trained model whose number of visual input streams matches the number of visible speakers in the video. For example, for a video with two visible speakers, a two-speaker model was used. We performed separation using a single forward pass per video, which our model supports, since our network architecture never enforces a specific temporal duration. This allows us to avoid the need to post-process and consolidate results on shorter chunks of the video. Because there is no clean reference audio for these examples, these results and their comparison to other methods are evaluated qualitatively; they are presented in our supplementary material. It should be noted that our method does not work in real-time, and, in its current form, our speech enhancement is better suited for the post-processing stage of video editing.
The synthetic “Double Brady” video in our supplementary material highlights the utilization of visual information by our model, as it is very difficult to perform speech separation in this scenario using only characteristic speech frequencies contained in the audio.
The “Noisy Bar” scene shows a limitation of our approach in separating speech from mixtures with low SNR. In this case, the background noise is almost entirely suppressed, however output speech quality is noticeably degraded. Sun et al.  observed that this limitation stems from the use of a masking-based approach for separation, and that in this scenario, directly predicting the denoised spectrogram could help overcome this problem. In cases of classic speech enhancement, i.e. one speaker with non-speech background noise, our AV model obtains similar results to those of our strong AO baseline. We suspect this is because the characteristic frequencies of noise are typically well separated from the characteristic frequencies of speech, and therefore incorporating visual information does not provide additional discrimination capabilities.
Our evaluation would not be complete without comparing our results to those of previous work in AV speech separation and enhancement. Table 5 contains these comparisons on three different AV datasets, Mandarin, TCD-TIMIT and CUAVE, mentioned in Section 2, using the evaluation protocols and metrics described in the respective papers. The reported objective quality scores are PESQ [Rix et al., 2001], STOI [Taal et al., 2010] and SDR from the BSS eval toolbox [Vincent et al., 2006]. Qualitative results of these comparisons are available on our project page.
It is important to note that these prior methods require training a dedicated model for each speaker in their dataset (speaker dependent), whereas our evaluation on their data is done using a model trained on our general AVS dataset (speaker independent). Despite having never encountered these particular speakers before, our results are significantly better than those reported in the original papers, indicating the strong generalization capability of our model.
|Gabbay et al. ||Hou et al. ||Ours|
|Gabbay et al. ||Ours|
|Casanovas et al. ||Pu et al. ||Ours|
While our focus in this paper is speech separation and enhancement, our method can also be useful for automatic speech recognition (ASR) and video transcription. As a proof of concept, we performed the following qualitative experiment. We uploaded our speech-separated results for the “Stand-Up” video to YouTube, and compared the resulting captions produced by YouTube’s automatic captioning333https://support.google.com/youtube/answer/6373554?hl=en with those it produced for the corresponding source videos with mixed speech. For parts of the original “Stand-Up” video, the ASR system was unable to generate any captions in mixed speech segments of the video. The results included speech from both speakers, resulting in hard-to-read sentences. However, captions produced on our separated speech results were noticeably more accurate. We show the full captioned videos in our supplementary material.
We also conducted extensive experiments to better understand the model’s behavior and how its different components affect the results.
In order to better understand the contribution of different parts of our model, we performed an ablation study on the task of speech separation from a mixture of two clean speakers (2S Clean). In addition to ablating several combinations of network modules (visual and audio streams, BLSTM and FC layers), we also investigated higher-level changes such as a different output mask (magnitude), the effect of reducing the learned visual features to one scalar per timestep, and a different fusion method (early fusion).
In the early fusion model, we do not have separate visual and audio streams, but rather combine the two modalities at the input. This is done by first using two fully connected layers to reduce the dimensionality of each visual embedding to match the spectrogram dimension at each timestep, then stacking the visual features as a third spectrogram “channel” and processing them jointly throughout the model.
Table 6 shows the results of our ablation study. The table includes evaluation using SDR and ViSQOL [Hines et al., 2015], an objective measure intended to approximate human listener mean opinion scores (MOS) of speech quality. The ViSQOL scores were calculated on a random 2000 sample subset of our testing data. We found that SDR correlates well with the amount of noise left in the separated audio, and ViSQOL is a better indicator of output speech quality. See Section A in the Appendix for more details on these scores. “Oracle” RMs and cRMs are masks obtained as described in Section 4.1, by using the ground truth real-valued and complex-valued spectrograms, respectively.
The most interesting findings of this study are the drop in MOS when using a real-valued magnitude mask rather than a complex one, and the surprising effectiveness of squeezing the visual information into one scalar per timestep, described below.
In our ablation analysis we found that a network which squeezes the visual information into a bottleneck of one scalar per timestep (“Bottleneck (cRM)”) performs almost as well (only 0.5 dB less) as our full model (“Full model (cRM)”) that uses 64 scalars per timestep.
Our model uses face embeddings as the input visual representation (Section 4.1). We want to gain insights on the information captured in these high-level features and to identify which regions of the input frames are used by the model for separating the speech. To this end, we follow a similar protocol as in [Zhou et al., 2014; Zeiler and Fergus, 2014] for visualizing receptive fields of deep networks. We extend that protocol from 2D images to 3D (space-time) video. More specifically, we use a space-time patch occluder ( patch444We use 200ms length to cover the typical range of phoneme duration: 30-200 ms.) in a sliding window fashion. For each space-time occluder, we feed-forward the occluded video into our model and compare the speech separation result, , with the one obtained on the original (non-occluded) video, . To quantify the difference between the network outputs, we use SNR, treating the result without the occluder as the “signal”555We refer the reader to the supplementary material to validate that our separated speech on the non-occluded video, which we treat as “correct” in this example, is indeed accurate.. That is, for each space-time patch, we compute:
Repeating this process for all space-time patches in a video results in a heat map for each frame. For visualization purposes we normalize the heat maps by the maximum SNR for the video: . In , high values correspond to patches with high impact on the speech separation result.
In Fig. 8 we show the resulting heat maps for representative frames from several videos (the full heat map videos are available on our project page). As expected, the facial regions that contribute the most are located around the mouth, yet the visualization reveals that other areas such as the eyes and cheeks contribute as well.
We further tested the contribution of visual information to the model by gradual elimination of visual embeddings. Specifically, we start by running the model and evaluating the speech separation quality using visual information for the full 3 second video. We then gradually discard embeddings from both ends of the segment, and re-evaluate the separation quality with visual durations of 2, 1, 0.5 and 0.2 seconds.
The results are shown in Fig. 9. Interestingly, the speech separation quality is reduced by only dB on average when dropping as much as of the visual embeddings in the segments. This shows the robustness of the model to missing visual information, which may occur in real world scenarios due to head motion or occlusions.
We proposed an audio-visual neural network-based model for single-channel, speaker-independent speech separation. Our model works well in challenging scenarios, including multi-speaker mixtures with background noise. To train the model, we created a new audio-visual dataset with thousands of hours of video segments containing visible speakers and clean speech we collected from the Web. We showed state-of-the-art results on speech separation as well as a potential application to video captioning and speech recognition. We also conducted extensive experiments to analyze the behavior of our model and its components.
We would like to thank Yossi Matias and Google Research Israel for their support for the project, and John Hershey for his valuable feedback. We also thank Arkady Ziefman for his help with figure design and video editing, and Rachel Soh for helping us procure permissions for video content in our results.
Handbook of Blind Source Separation: Independent component analysis and applications. Academic press.
ICCV 2017 Workshop on Computer Vision for Audio-Visual Media.
Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks.IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015).
Audio-visual sound separation via hidden Markov models. InAdvances in Neural Information Processing Systems. 1173–1180.
Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks.IEEE Transactions on Emerging Topics in Computational Intelligence 2, 2 (2018), 117–128.
The signal-to-distortion ratio (SDR), introduced by Vincent et al. , is one of a family of measures designed to evaluate Blind Audio Source Separation (BASS) algorithms, where the original source signals are available as ground truth. The measures are based on the decomposition of each estimated source signal into a true source part () plus error terms corresponding to interferences (), additive noise () and algorithmic artifacts ().
SDR is the most general score, commonly reported for speech separation algorithms. It is measured in dB, and is defined as:
We refer the reader to the original paper for details on signal decomposition into its components. We found this measure to correlate well with the amount of noise left in the separated audio.
The Virtual Speech Quality Objective Listener (ViSQOL) is an objective speech quality model, introduced by Hines et al. . The metric models human speech quality perception using a spectro-temporal measure of similarity between a reference () and a degraded () speech signal, and is based on the Neurogram Similarity Index Measure (NSIM) [Hines and Harte, 2012]. NSIM is defined as
where the s and s are mean and correlation coefficients, respectively, calculated between reference and degraded spectrograms.
In ViSQOL, NSIM is calculated on spectrogram patches of the reference signal and their corresponding patches from the degraded signal. The algorithm subsequently aggregates and translates the NSIM scores into a mean opinion score (MOS) between 1 and 5.