1 Introduction000* this work was mainly performed when Singh and Basak were interns at Sony.
The problem of separating a voice from other sources, such as speech separation (separating multiple overlapping speech signals)[1, 2, 3, 4] and singing voice separation (separating vocals from other instrumental sounds)[5, 6]
has been actively investigated for decades. Various approaches including spectral clustering, computational auditory scene analysis , and non-negative matrix factorization (NMF)[9, 10, 11, 12]
has been proposed to tackle this problem. Recent advances in deep-learning-based methods have dramatically improved the accuracy of separation, and in some constrained scenarios, some methods perform nearly as well as or even better than ideal mask methods, which is used for theoretical upper baselines. However, the performance drops significantly when we consider more challenging and realistic scenario such as separation in a noisy and reverberant environment  or limited training data availability .
There have been several attempts to improve voice separation in challenging scenarios by incorporating auxiliary information. In [15, 16], visual cues were used to incorporate a speaker’s lip movement, which correlates to speech, and improvement over an audio only approach was observed in a noisy environment. However, such visual information is not always available owing to occlusion, low light levels or the unavailability of a camera. In [17, 18], guided (or informed) source separation was proposed where voice activity information was used to guide a voice separation model when the target speech is active. However, such binary information provides only marginal information and the phonetic aspect of speech is largely neglected.
In this work, we propose to explicitly incorporate the phonetic nature of a voice using a transfer learning approach. The transfer learning approach, where a deep neural network model is trained on a task in which a large dataset is available and hidden layer representation is transferred to another task, which might not have enough data, has been shown to be effective in an array of tasks including those in the audio domain[19, 20, 21]
. Here, we propose to utilize an end-to-end automatic speech recognition (E2EASR) system trained on a large speech corpus, and transfer deep features extracted from E2EASR (E2EASR features) to a voice separation task. E2EASR aims to combine all sub modules in an ASR system such as acoustic models, pronunciation models, and language models. Therefore, E2EASR is designed to model longer dependences, and hence E2EASR features are expected to contain longer linguistic information than conventional acoustic models which typically accept a few frames as input. If the phonetic and linguistic information is known, voice separation can more robustly estimate the target voice spectrogram under a noisy or unseen condition by leveraging the prior knowledge of phone-dependent spectrogram shape. For example, knowing that the current spectrum corresponds to fricatives, one can expect the target spectrum to contain high frequency components even when it is unclear from the noisy spectrogram. There are prior works that used a conventional ASR for a speech enhancement task (separating speech from noise). Raj et.al. proposed phone-dependent NMF, where the bases of NMF were pre-trained on each phoneme and noise, and ASR was used to choose which bases to use for speech reconstruction. In , a DNN-based phoneme-specific voice separation approach was propsoed, where the DNN models were trained for each phoneme and ASR was used to choose the model. However, these NMF bases and the DNN model selection approach treat each phoneme independently and largely ignore the context of the sequence of phonemes, making it difficult to incorporate long-term dependence. Moreover, such hard model switching mechanism largely relies on the accuracy of ASR and may produce artifacts due to misclassification or discontinuity. On the other hand, our method utilizes a single model to separate entire utterances and continuously incorporates phonetic information. This allows us to model longer dependences.
The contributions of this work are summarized as follows:
We propose a transfer learning based approach to incorporate phonetic and linguistic nature of speech for voice separation. To this end, we propose the E2EASR features.
We evaluated the proposed method on a simultaneous speech separation and enhancement task using AVSpeech and Audio-Set datasets, whose audio is recorded in non-controlled environments, and show that the proposed method significantly improves the separation accuracy over a model trained without E2EASR features and a model trained with visual features.
We further show that even though E2EASR is trained on standard speech, it transferred robustly for a singing voice separation task with limited amount of data.
2 End-to-end ASR feature
To capture phonetic and linguistic information, it is important to model the long-term dependences of an utterance. Conventional ASRs typically decompose the problem to acoustic modeling, pronunciation modeling, and linguistic modeling by assuming conditional independence between observations. The acoustic models typically accept only a few adjacent frames to estimate posterior probabilities of phonemes (or alternative subword units) for each frame, and then the hidden Malkov model (HMM) is used to model the sequence of phonemes. However, this mechanism limits the modeling capabilities and requires expensive handcrafted linguistic resources such as a pronunciation dictionary, tokenization or phonetic context-dependency trees. E2EASR attempts to overcome these problems by combining all modules and learning unified model from only orthographic transcriptions which are easy to obtain. In , authors alleviated the conditional independence assumption by introducing a hybrid CTC/attention architecture, that allows the long term dependence to be modeled. This motivated us to use this model for transfer learning in voice separation task since we expect that the single model can provide phonetic and linguistic information. Furthermore, since the model is fully DNN-based, it has the potential to jointly train the voice separation and the E2EASR model by standard back-propagation. However, in this work, we adopt the deep feature approach since it is easy to use for many different tasks such as speech separation, speech enhancement, and singing voice separation.
Deep features are a convenient yet powerful way of transfer learning for a DNN-based model. Many deep features in image , video  or audio domains  are extracted from the activations of the last few fully connected (fc) layers of DNN models trained on a classification task in which the input size is fixed. Here, we want the E2EASR features to maintain its time resolution while containing long term dependency. Therefore, deep features are extracted from the encoder output to preserve the time alignment, as shown in Figure 1. Another possibility could be to use attention weights as deep features, which we leave as a future work.
3 Voice Separation with E2EASR feature
The voice separation process using E2EASR features is illustrated in Figure 2. The proposed E2EASR features are passed to the voice separation model along with the input mixture to incorporate phonetic and linguistic information. The source separation model is equipped with a domain translation network to convert the E2EASR feature to a suitable representation for voice separation, and the output of the domain translation network is concatenated with audio along the channel dimension. The domain translation network is trained simultaneously for the voice separation task. During the training, we extracted the E2EASR features from the target (oracle) speech. During the inference time, since the target speech is unknown, we first separate the voice without using E2EASR features. For this, we can use another voice separation model that does not use E2EASR features. However, to avoid the need for an additional model, we can instead use the model that uses E2EASR features but feed zero data as E2EASR features for the initial stage. Since deep features tend to have sparse representation, feeding zero data as E2EASR features does not disrupt results. We found that this approach provides compatible results with the model trained with audio only. After the initial voice separation, we extract the E2EASR features from the separated voices and feed the feature to the voice separation model again with the input mixture. Note that we used the oracle target speech during the training time to avoid the voice separation model recursively depending on its separation quality through the E2EASR features from separated sources. This could cause a mismatch in the E2EASR feature quality between the training and inference times since the estimated source is used for inference. To close the gap, we can iterate the voice separation and E2EASR feature extraction to progressively enhance the feature quality.
We evaluated the proposed approach on a speech separation task and singing voice separation (SVS) task. The experiments on speech separation was designed to evaluate the noisy speech scenario while the experiments on SVS examined the case of limited data availability, which is really the case in the community.
4.1 Speech separation
To evaluate the proposed method under a realistic and challenging environment, we first conducted experiments on a single-channel speaker-independent simultaneous speech separation and enhancement task, where the goal is to separate multiple overlapping speech while removing the noise. For the speech dataset, we used the AVSpeech dataset , which consists of 4700 hours of YouTube videos of lectures and how-to videos. We used a subset of the dataset: 100 hours for training and 15 hours for testing. The audio in the AVSpeech dataset was recorded in less-controlled environments than that in the WSJ0 dataset, which is often used to evaluate speech separation methods, making the task more challenging. Indeed, we found that a speech separation model that performs well on the WSJ0 dataset performs poorly on the AVSpeech dataset. To make the problem more challenging, we also added noise from the AudioSet dataset, which is a collection of 10-second sound clips drawn from YouTube videos. We omitted the classes likely to contain human voices. After processing, the training and testing sets consisted of 105 and 11 hours of audio, respectively. The preprocessing of data was done similar to , where 3 second, sequential, non-overlapping crops of audio from the AVSpeech and Audio-Set datasets were extracted and normalized to have a maximum amplitude of one for the speech and 0.3 for the noise, respectively. Mixtures were created by randomly mixing two different speakers and the noise. The LibriSpeech corpus, which consists of 960 hours of speech, was used for training E2EASR.
4.1.2 Model Architecture
For the voice separation network architecture, we adopted TASNet , which is a recently proposed time domain speech separation network that produced state-of-the-art results on the WSJ-2mix and WSJ-3mix datasets. Architecture details are illustrated in Figure 3
. E2EASR features were interpolated after passing through the domain translation network to have the same frame rate as the encoder outputs of TASNet. The domain translation network comprised six 1-D convolution layers withfilters and a filter size of . Then, its output was concatenated with the encoder output and passed to the separation branch for mask prediction. We used the baseline visual feature model, which is described in the next section, as the initial voice separation model. For E2EASR, we used ESPnet framework 
, in which the hybrid CTC/attention model is available.
We considered two baseline models. For the lower baseline, we trained TASNet without extra features using permutation invariant training (PIT). To assess the quality of the E2EASR features, we also considered visual features extracted as follows: the lip regions were cropped from video after the faces were aligned using the facial landmarks derived in 
and reshaped to dimensions of (96, 96). Then an autoencoder with an architecture of 3 convolution layers followed by 2 linear layers and 3 transposed convolution layers was trained on lip images at the frame level. The bottleneck layer activations were used as visual deep features. We also tried using the visual features in[16, 15], however we found that these were less effective due to the unavailability of the training data used in [16, 15]. The network architectures were the same as the one in the proposed model with E2EASR features.
Table 1 compares the scale-invariant signal-to-distortion ratio improvement (SI-SDRi) . As shown in the table, providing additional features to the source separation model generally improves the performance. The visual feature provides dB improvement over the audio only lower baseline model. This result is similar to , although the metric is slightly different. Our proposed method significantly outperformed the visual feature model, providing 3.2 dB improvement over the lower baseline. This is somewhat surprising since the proposed method significantly outperformed the audio visual model with a single modality. To further assess the effects of the E2EASR features quality mismatch between the training and inference, we also evaluated the oracle E2EASR features, where the features were extracted from the target signal, which is not available in a real scenario. The SI-SNRi difference between the oracle and estimated E2EASR features was only 0.2 dB. This indicates that E2EASR features are robustly extracted from degraded voices, and are therefore less sensitive to the quality of the initial voice separation model.
|No extra features (PIT)||9.0|
|Estimated E2EASR feature||12.5|
|Oracle E2EASR feature||12.7|
4.2 Singing Voice Separation
We further evaluated the proposed method on an SVS task using MUSDB dataset, prepared for SiSEC 2018 . MUSDB has 100 and 50 songs each in the Dev and Test sets, respectively recorded in stereo format at a 44.1kHz sampling rate. Even though MUSDB is one of the largest professionally recorded dataset for the SVS task, it has only about 6.7 hours of data for training, which is very small compared with the AVspeech dataset, which has about 4700 hours of data. Moreover, this task was much more difficult since the E2EASR model was trained on standard speech then transferred to singing voice data which has significant domain mismatch as it is more variable in the fundamental frequency range, timbre, tempo, and dynamics.
We used multi-scale DenseNet (MDenseNet)  for the voice separation network. Since our goal is not to achieve state-of-the-art performance but to show the effectiveness of the E2EASR features, we converted the input magnitude spectrogram to a 128-band mel spectrogram to reduce the dimension. The E2EASR features were passed to the domain translation network and concatenated to the audio after the initial convolution layer of MDenseNet. For the initial voice separation, we used the same model by providing all zero values as dummy E2EASR features. This allows us to avoid the need for an additional model. We used the same E2EASR model as in the previous experiment and it was not trained on a singing voice dataset.
We calculated the SDR improvement using the museval package  and compared in Table 1. As baselines, we consider the MDenseNet model trained without E2EASR features. Event though E2EASR was not trained on singing voice data during the training, the E2EASR features lead improved SDR over the baseline. Moreover, the difference between the oracle and estimated E2EASR features is marginal, indicating the robustness of E2EASR features against the initial voice separation quality.
|Baseline (w/o feature)||8.65|
|Estimated E2EASR feature||8.80|
|Oracle E2EASR feature||8.82|
We proposed a transfer learning approach to leverage the E2EASR model for voice separation. Experimental results on the simultaneous speech separation and enhancement task shows that the proposed E2EASR features provide a significant improvement over baselines including a model using visual cues. We further show that the E2EASR features improve the performance of SVS, demonstrating its robustness against limited data availability and domain mismatch.
-  J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. ICASSP, 2016, pp. 31–35.
M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen,
“Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,”IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 25, pp. 1901–1913, 2017.
-  Y. Luo and N. Mesgarani, “Tasnet: Surpassing ideal time-frequency masking for speech separation,” CoRR, 2018.
-  N. Takahashi, P. Sudarsanam, N. Goswami, and Y. Mitsufuji, “Recursive speech separation for unknown number of speakers,” in Proc. Interspeech, 2019.
-  N. Takahashi, P. Agrawal, N. Goswami, and Y. Mitsufuji, “Phasenet: Discretized phase modeling with deep neural networks for audio source separation,” in Proc. Interspeech, 2018, pp. 3244–3248.
-  N. Takahashi and Y. Mitsufuji, “Multi-scale Multi-band DenseNets for Audio Source Separation,” in Proc. WASPAA, 2017, pp. 261–265.
F. R. Bach and M. I. Jordan,
“Learning Spectral Clustering, with Application to Speech
The Journal of Machine Learning Research, vol. 7, pp. 1963–2001, Jan 2006.
-  K. Hu and D. Wang, “An Unsupervised Approach to Cochannel Speech Separation,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 21, pp. 122–131, Jan 2013.
-  M. N. Schmidt and R. K. Olsson, “Single-channel speech separation using sparse non-negative matrix factorization,” in Proc. Interspeech, 2006.
T. Virtanen and A. T. Cemgil,
“Mixtures of gamma priors for non-negative matrix factorization
based speech separation,”
Proc. Independent Component Analysis and Signal Separation (ICA), 2009, pp. 646–653.
-  G. J. Mysore and P. Smaragdis, “A non-negative approach to language informed speech separation,” in Proc. Latent Variable Analysis and Signal Separation (LVA/ICA), 2012, pp. 356–363.
-  Z. Wang and F. Sha, “Discriminative non-negative matrix factorization for single-channel speech separation,” in Proc. ICASSP, 2014, pp. 3749–3753.
-  G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending Speech Separation to Noisy Environments,” in Proc. Interspeech, 2019.
-  N. Takahashi, N. Goswami, and Y. Mitsufuji, “MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation,” in Proc. IWAENC, 2018.
-  T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” in Proc. Interspeech, 2018, pp. 3244–3248.
-  A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separationg,” in Proc. SIGGRAPH, 2018.
-  N. Kanda, C. Boeddeker, J. Heitkaemper, Y. Fujita, S. Horiguchi, K. Nagamatsu, and R. Haeb-Umbach, “Guided source separation meets a strong asr backend: Hitachi/paderborn university joint investigation for dinner party asr,” in Proc. Interspeech, 2019.
-  T. Chan, T. Yeh, Z. Fan, H. Chen, L. Su, Y. Yang, and R. Jang, “Vocal activity informed singing voice separation with the ikala dataset,” in Proc. ICASSP, April 2015, pp. 718–722.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in Proc. ICML, 2014.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” in Proc. ICCV, 2014, pp. 4489–4497.
-  N. Takahashi, M. Gygli, and L. Van Gool, “Aenet: Learning deep audio features for video analysis,” IEEE Trans. on Multimedia, vol. 20, pp. 513–524, 2017.
-  B. Raj, R. Singh, and T. Virtanen, “Phoneme-dependent nmf for speech enhancement in monaural mixtures,” in Proc Interspeech, 2011.
-  Z.-Q. Wang, Y. Zhao, and D. Wang, “Phoneme-specific speech separation,” in Proc ICASSP, 2016.
-  N. Takahashi, T. Naghibi, and B. Pfister, “Automatic Pronunciation Generation by Utilizing a Semi-supervised Deep Neural Networks,” in Proc. Interspeech, 2016.
-  S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, Dec 2017.
-  J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in Proc. ICASSP, April 2015, pp. 5206–5210.
-  S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-End Speech Processing Toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
-  V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in Proc. CVPR, 2014, pp. 1867–1874.
-  A. Liutkus, F.-R. Stöter, and N. Ito, “The 2018 signal separation evaluation campaign,” in Proc LVA/ICA, 2018.