End-to-End Multi-speaker Speech Recognition with Transformer

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9 to 12.1 multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/15/2019

MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

Recently, the end-to-end approach has proven its efficacy in monaural mu...
02/08/2021

End-to-End Multi-Channel Transformer for Speech Recognition

Transformers are powerful neural architectures that allow integrating di...
11/02/2020

Multitask Learning and Joint Optimization for Transformer-RNN-Transducer Speech Recognition

Recently, several types of end-to-end speech recognition methods named t...
10/10/2021

Multi-Channel End-to-End Neural Diarization with Distributed Microphones

Recent progress on end-to-end neural diarization (EEND) has enabled over...
08/30/2021

Multi-Channel Transformer Transducer for Speech Recognition

Multi-channel inputs offer several advantages over single-channel, to im...
05/21/2020

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming

Despite successful applications of end-to-end approaches in multi-channe...
11/09/2021

Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer

Acoustic echo cancellation (AEC) is a technique used in full-duplex comm...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning techniques have dramatically improved the performance of separation and automatic speech recognition (ASR) tasks related to the cocktail party problem [4], where the speech from multiple speakers overlaps. Two main scenarios are typically considered, single-channel and multi-channel. In single-channel speech separation, various methods have been proposed, among which deep clustering (DPCL) based methods [10] and permutation invariant training (PIT) based methods [31] are the dominant ones. For ASR, methods combining separation with single-speaker ASR as well as methods skipping the explicit separation step and building directly a multi-speaker speech recognition system have been proposed, using either the hybrid ASR framework [30, 1, 17] or the end-to-end ASR framework [22, 21, 2]. In the multi-channel condition, the spatial information derived from the inter-channel differences can help distinguish between speech sources from different directions, which makes the problem easier to solve. Several methods have been proposed for multi-channel speech separation, including DPCL-based methods using integrated beamforming [6] or inter-channel spatial features [25], and a PIT-based method using a multi-speaker mask-based beamformer [27]. For multi-channel multi-speaker speech recognition, an end-to-end system was proposed in [3], called MIMO-Speech because of the multi-channel input (MI) and multi-speaker output (MO). This system consists of a mask-based neural beamformer frontend, which explicitly separates the multi-speaker speech via beamforming, and an end-to-end speech recognition model backend based on the joint CTC/attention-based encoder-decoder [15] to recognize the separated speech streams. This end-to-end architecture is optimized via only the connectionist temporal classification (CTC) and cross-entropy (CE) losses in the backend ASR, but is nonetheless able to learn to develop relatively good separation abilities.

Recently, Transformer models [24] have shown impressive performance in many tasks, such as pretrained language models [20, 5], end-to-end speech recognition [14, 13], and speaker diarization [9]

, surpassing the long short-term memory recurrent neural networks (LSTM-RNNs) based models. One of the key components in the Transformer model is self-attention, which computes the contribution information of the whole input sequence and maps the sequence into a vector at every time step. Even though the Transformer model is powerful, it is usually not computationally practical when the sequence length is very long. It also needs adaptation for specific tasks, such as the subsampling operation in encoder-decoder based end-to-end speech recognition. However, for signal-level processing tasks such as speech separation and enhancement, subsampling is usually not a good option, because these tasks need to maintain the original time resolution.

In this paper, we explore the use of Transformer models for end-to-end multi-speaker speech recognition in both the single-channel and multi-channel scenarios. First, we replace the LSTMs in the encoder-decoder network of the speech recognition module with Transformers for both scenarios. Second, in order to also apply Transformers in the masking network of the neural beamforming module in the multi-channel case, we modify the self-attention layers to reduce their memory consumption in a time-restricted (or local) manner, as used in [16, 19, 1]. To the best of our knowledge, this work is the first attempt to use the Transformer model for tasks such as speech enhancement/separation with such very long sequences. Another contribution of this paper is to improve the robustness of our model in reverberant environments. To do so, we incorporate an external dereverberation method, the weighed prediction error (WPE) [29], to preprocess the reverberated speech. The experiments show that this straightforward method can lead to a performance boost for reverberant speech.

Figure 1: End-to-end single-channel multi-speaker model in the 2-speaker case. The speaker-differentiating encoder (Enc), recognition encoder (Enc), and decoder are either RNNs or Transformers.
Figure 2: End-to-end multi-channel multi-speaker model in the 2-speaker case. The masking network and end-to-end ASR network are based on either RNNs or Transformers.

2 End-to-End Multi-speaker ASR

In this section, we review the end-to-end speech recognition models for both the single-channel [21, 2] and multi-channel [3] tasks. For both tasks, we denote by the number of speakers in the input speech mixture.

2.1 Single-channel Multi-speaker ASR

In this subsection, we briefly introduce the end-to-end single-channel multi-speaker speech recognition model proposed in [21, 2], shown in Fig.2. The model is an extension of the joint CTC/attention-based encoder-decoder framework [15] to recognize multi-speaker speech. The input is the single-channel mixed speech feature. In the encoder, the input feature is separated and encoded as hidden states for each speaker. The computation of the encoder can be divided into three submodules:

(1)
(2)
(3)

Encoder first maps the input to some high dimensional representation . Then speaker-differentiating encoders Encoder extract each speaker’s speech . Finally, Encoder transforms each into the embeddings ,

with subsampling. The attention-based decoder then takes these hidden representations to generate the corresponding output token sequences

. For each embedding sequence , the recognition process is formalized as follows:

(4)
(5)
(6)

in which denotes the context vector and is the hidden state of the decoder at step . To determine the permutation of the reference sequences , permutation invariant training (PIT) is performed on the CTC loss right after the encoder [21, 2]:

(7)

where is the sequence obtained from

by linear transform to compute the label posterior distribution,

is the set of all permutations on , and is the -th element of permutation . The model is optimized with both CTC and cross-entropy losses:

(8)

where

is an interpolation factor, and

is the cross-entropy loss of the attention-decoder.

2.2 Multi-channel Multi-speaker ASR

In this subsection, we review the model architecture of the MIMO-Speech end-to-end multi-channel multi-speaker speech recognition system [3], shown in Fig.2. The model takes as input the microphone-array signals from an arbitrary number of sensors. The model can be roughly divided into two modules, namely the frontend and the backend. The frontend is a mask-based multi-source neural beamformer. For simplicity of notation, we denote the noise as the

-th source in the mixture signals. First, the monaural masking network estimates the masks

for every source on each channel from the complex STFT of the multi-channel mixture speech, , where and represent the time and frequency indices, as follows:

(9)

where . Second, the multi-source neural beamformer separates each source from the mixture based on the MVDR formalization [23]. The estimated masks of each source are used to compute the corresponding power spectral density (PSD) matrices for [28, 11, 8]:

(10)

where , and represents the conjugate transpose. The time-invariant filter coefficients for each speaker are then computed from the PSD matrices:

(11)

where , and is a vector representing the reference microphone derived from an attention mechanism [18]. The beamforming filters can be used to obtain the enhanced signal for speaker

, which is further processed to get the log mel-filterbank with global mean and variance normalization (

):

(12)
(13)

where

is the short-time Fourier transform (STFT) of

.

The backend ASR module maps the speech feature of each speaker into the output token sequences . The computation of the speech recognition is very similar to the process for the single-channel case described in Sec. 2.1, except that the encoder is a single path network and does not have to separate the input feature using Encoder.

Similar to the single-channel model, the permutation order of the reference sequences is determined by (7). The whole MIMO-Speech model is optimized only with ASR loss as in (8).

3 Transformer with Time-restricted Self-Attention

In this section, we describe one of the key components in the Transformer architecture, the multi-head self-attention [24], and the time-restricted modification [19] for its application in the masking network of the frontend.

Transformers employ the dot-product self-attention for mapping a variable-length input sequence to another sequence of the same length, making them different from RNNs. The input consists of queries , keys , and values of dimension . The weights of the self-attention are obtained by computing the dot-product between the query and all keys and normalizing with softmax. A scaling factor is used to smooth the distribution:

(14)

To capture information from different representation subspaces, multi-head attention (MHA) is used by multiplying the original queries, keys, and values by different weight matrices:

(15)
(16)

where is the number of heads, and and are learnable parameters.

In general, the speech sequence length can be considerably long, making self-attention computationally difficult. For tasks like speech separation and enhancement, the technique of subsampling is not practical as in speech recognition. Inspired by [16, 19], we adjust the self-attention of the Transformers in the masking network to be performed on a local segment of the speech, because those frames have higher correlation. This time-restricted self-attention for the query at time step is formalized as:

(17)

where the corresponding keys and values are and , respectively, with and here denoting the left and right context window sizes.

4 Experiments

The proposed methods were evaluated on the same dataset as in [3], referred to as the spatialized wsj1-2mix dataset, where the number of speakers in an utterance is . The multi-channel speech signals were generated111The spatialization toolkit is available at http://www.merl.com/demos/deep-clustering/spatialize_wsj0-mix.zip from the monaural wsj1-2mix speech used in [21, 2]. The room impulse responses (RIR) for the spatialization were randomly generated222The RIR generator script is available online at https://github.com/ehabets/RIR-Generator, characterizing the room dimensions, speaker locations, and microphone geometry. The final spatialized dataset contains two different environment conditions, anechoic and reverberant. In the anechoic condition, the room is assumed to be anechoic and only the delays and decays due to the propagation are considered when generating the signals. In the reverberant condition, reverberation is also considered, with randomly drawn T60s from  s. In total, the spatialized corpus under each condition contains 98.5 hr, 1.3 hr, and 0.8 hr in training, development, and evaluation sets respectively.

In the single-channel multi-speaker speech recognition task, we used the 1st channel of the training, development, and evaluation set to train, validate, and evaluate our model respectively. The input features are 80-dimensional log mel-filterbank coefficients with pitch features and their delta and delta delta features. In the multi-channel multi-speaker speech recognition task, we also followed [3]

in including the WSJ train_si284 in the training set to improve the performance. The model takes the raw waveform audio signal as input and converts it to its STFT using a 25 ms-long Hann window with stride 10 ms. The spectral feature dimension is

due to zero-padding. After the frontend computation, 80-dimensional log filterbank features are extracted for each separated speech signal and global mean-variance normalization is applied, using the statistics of the single-speaker WSJ1 training set. All the multi-channel experiments were performed with

channels. However, the model can be extended to an arbitrary number of input channels as described in [18].

4.1 Experimental Setup

All the proposed end-to-end multi-speaker speech recognition models are implemented with the ESPnet framework [26]

using the Pytorch backend. Some basic parts are the same for all the models. The interpolation factor

of the loss function in (

8) is set to 0.2. The word-level language model [12] used during decoding was trained with the official text data included in the WSJ corpus. The configurations of the RNN-based models are the same as in [2] and [3] for single-channel and multi-channel experiments, respectively.

In the Transformer-based multi-speaker encoder-decoder ASR model, there is a total of 12 layers in the encoder and 6 layers in the decoder as in [14]. Before the Transformer encoder, the log mel-filterbank features are encoded by two CNN blocks. The CNN layers have a kernel size of and the number of feature maps is 64 in the first block and 128 in the second block. For the single-channel multi-speaker model in Sec. 2.1, is the same as the CNN embedding layer, and and contain and Transformer layers, respectively. For all the tasks, the configuration of each encoder-decoder layer is , , . The masking network in the frontend has 3 layers similar to the encoder-decoder layer except . The training stage of Transformer runs with the Adam optimizer and Noam learning rate decay as in [24]

. Note that the backend ASR module is currently initialized with a pretrained model from the ESPnet recipe of WSJ corpus and kept frozen for the first 15 epochs, for training stability.

4.2 Performance in Anechoic Condition

We first provide in Table 1 the performance in anechoic condition of the single-channel multi-speaker end-to-end ASR models trained and evaluated on the original single-channel wsj1-2mix corpus used in [12, 2]. All the layers are randomly initialized. The result shows that using the Transformer model leads to a relative word error rate (WER) improvement on the evaluation set, decreasing from to compared with the RNN-based model in [2].

The multi-channel multi-speaker speech recognition performance is shown in Table 2 using the spatialized anechoic wsj1-2mix dataset. The baseline multi-channel system is the RNN-based model from our previous study [3]. Before we move to the fully Transformer-based MIMO-Speech model, we first replace the RNNs with Transformers in the backend ASR only. We see that using Transformers for the ASR backend can achieve relative improvement against the RNN-based model in anechoic conditions.

We then also apply Transformers in the masking network of the frontend. Considering the feasibility of computing, in this preliminary study, the left and right context window sizes of the self-attention are set to and . The parameters of the frontend are randomly initialized. Compared with using a Transformer-based model only for the backend, the fully Transformer-based model leads to a further improvement, achieving a WER of . Compared against the whole sequence information available in the RNN-based model, such a small context window greatly limits the power of our model but shows its potential. Overall, the proposed fully Transformer-based model achieves a relative WER improvement against the RNN-based model in the multi-channel case. We also see that the multi-channel system is better than the single-channel system, thanks to the availability of spatial information.

Model dev eval
RNN-based 1-channel Model [2] 24.90 20.43
Transformer-based 1-channel Model 17.11 12.08
Table 1: Performance in terms of average WER [%] on the single-channel anechoic wsj1-2mix corpus.
Model dev eval
RNN-based MIMO-Speech [3] 13.54 8.62
  + Transformer backend 10.73 6.85
      ++ Transformer frontend 11.75 6.41
Table 2: Performance in terms of average WER [%] on the spatialized two-channel anechoic wsj1-2mix corpus.

4.3 Performance in Reverberant Condition

Even though our model can perform very well in anechoic condition, such ideal environments are rarely encountered in practice. It is thus crucial to investigate whether the model can be applied in more realistic environments. In this subsection, we describe preliminary efforts to process the reverberated signal.

We first used a straightforward multi-conditioned training by adding reverberated utterances into the training set. The results of multi-speaker speech recognition on the multi-channel reverberant datasets are shown in Table 3. It can be observed that only using the Transformers for the backend is better than the RNN-based model. In addition, the fully Transformer-based model achieves relative WER improvement on the evaluation set, which is consistent with the anechoic case. However, comparing with the numbers for the anechoic condition in Table 2, a large performance degradation can be observed.

To alleviate this, we turned to an existing external dereverberation method to preprocess the input signals as a simple yet effective solution. Nara-WPE [7]

is a widely used open source software for blind dereverberation of acoustic signals. The dereverberation is performed on the reverberated speech before it is added to the training dataset with anechoic data. Similarly, the reverberant test set is also preprocessed. Speech recognition performance on the multi-channel reverberant speech after Nara-WPE is shown in Table

4. In general, the WERs are dramatically decreased with the dereverberation method. For the RNN-based model, the WER on the evaluation set decreased by relative, from to . Similar to the experiments under other conditions, the model with backend Transformer only is better than the RNN-based baseline model on the reverberant evaluation set by relative WER. However, the Transformer-based frontend slightly degraded the performance. This may be due the window size of the attention being too small, as it only covers about s of speech. Note that our systems are not trained through Nara-WPE, which is left for future work.

At last, we show results in the single-channel task with the 1st channel of the reverberated speech after Nara-WPE dereverberation in Table 5. Using the RNN-based model, the WER of the evaluation set is high, at , which is influenced greatly by the reverberation, even when preprocessing with the dereverberation technique. However, the Transformer-based model can reach a final WER of , a relative reduction, proving that the Transformer-based model is more robust than the RNN-based model.

Model dev eval
RNN-based MIMO-Speech [3] 34.98 29.99
  + Transformer backend 32.95 28.01
      ++ Transformer frontend 31.93 26.02
Table 3: Performance in terms of average WER [%] on the spatialized two-channel reverberant wsj1-2mix corpus.
Model dev eval
RNN-based MIMO-Speech 24.45 17.67
  + Transformer backend 19.17 15.24
      ++ Transformer frontend 20.55 15.46
Table 4: Performance in terms of average WER [%] on the spatialized two-channel reverberant wsj1-2mix corpus after Nara-WPE.
Model dev eval
RNN-based 1-channel Model 31.21 28.21
Transformer-based 1-channel Model 20.44 16.50
Table 5: Performance in terms of average WER [%] on the 1st channel of the spatialized reverberant wsj1-2mix corpus after Nara-WPE.

5 Conclusion

In this paper, we applied Transformer models for end-to-end multi-speaker ASR in both the single-channel and multi-channel scenarios, and observed consistent improvements. The RNN-based ASR module is replaced with the Transformers. To alleviate the fatal memory consumption issue when applying Transformers in the frontend with considerably long sequences, we modified the self-attention in the Transformers of the masking network by using a local context window. Furthermore, by incorporating an external dereverberation method, we largely reduced the performance gap between the reverberant condition and the anechoic condition, and hope to further reduce it in the future thanks to tighter integration of the dereverberation within our model.

References

  • [1] X. Chang, Y. Qian, and D. Yu (2018) Monaural multi-talker speech recognition with attention mechanism and gated convolutional networks. In Proc. ISCA Interspeech, Cited by: §1, §1.
  • [2] X. Chang, Y. Qian, K. Yu, and S. Watanabe (2019) End-to-end monaural multi-speaker ASR system without pretraining. In Proc. IEEE ICASSP, pp. 6256–6260. Cited by: §1, §2.1, §2, §4.1, §4.2, Table 1, §4.
  • [3] X. Chang, W. Zhang, Y. Qian, J. L. Roux, and S. Watanabe (2019) MIMO-Speech: end-to-end multi-channel multi-speaker speech recognition. arXiv preprint arXiv:1910.06522. Cited by: §1, §2.2, §2, §4.1, §4.2, Table 2, Table 3, §4, §4.
  • [4] E. C. Cherry (1953) Some experiments on the recognition of speech, with one and with two ears. The Journal of the Acoustical Society of America 25 (5), pp. 975–979. Cited by: §1.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL, Cited by: §1.
  • [6] L. Drude and R. Haeb-Umbach (2017) Tight integration of spatial and spectral features for bss with deep clustering embeddings.. In Proc. ISCA Interspeech, pp. 2650–2654. Cited by: §1.
  • [7] L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach (2018)

    NARA-WPE: a Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing

    .
    In ITG Fachtagung Sprachkommunikation (ITG), Cited by: §4.3.
  • [8] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, and J. Le Roux (2016-09) Improved MVDR beamforming using single-channel mask prediction networks. In Proc. ISCA Interspeech, pp. 1981–1985. Cited by: §2.2.
  • [9] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe (2019) End-to-end neural speaker diarization with self-attention. arXiv preprint arXiv:1909.06247. Cited by: §1.
  • [10] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe (2016) Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. IEEE ICASSP, pp. 31–35. Cited by: §1.
  • [11] J. Heymann, L. Drude, and R. Haeb-Umbach (2016-03) Neural network based spectral mask estimation for acoustic beamforming. In Proc. IEEE ICASSP, Cited by: §2.2.
  • [12] T. Hori, J. Cho, and S. Watanabe (2018-12) End-to-end speech recognition with word-based RNN language models. In Proc. IEEE SLT, pp. 389–396. Cited by: §4.1, §4.2.
  • [13] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, et al. (2019) A comparative study on Transformer vs RNN in speech applications. arXiv preprint arXiv:1909.06317. Cited by: §1.
  • [14] S. Karita, N. Enrique Yalta Soplin, S. Watanabe, M. Delcroix, A. Ogawa, and T. Nakatani (2019) Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proc. ISCA Interspeech, pp. 1408–1412. Cited by: §1, §4.1.
  • [15] S. Kim, T. Hori, and S. Watanabe (2017-03) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Proc. IEEE ICASSP, pp. 4835–4839. Cited by: §1, §2.1.
  • [16] M. Luong, H. Pham, and C. D. Manning (2015)

    Effective Approaches to Attention-based Neural Machine Translation

    .
    In Proc. EMNLP, pp. 1412–1421. Cited by: §1, §3.
  • [17] T. Menne, I. Sklyar, R. Schlüter, and H. Ney (2019) Analysis of deep clustering as preprocessing for automatic speech recognition of sparsely overlapping speech. In Proc. ISCA Interspeech, pp. 2638–2642. Cited by: §1.
  • [18] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey (2017) Multichannel end-to-end speech recognition. In Proc. ICML, pp. 2632–2641. Cited by: §2.2, §4.
  • [19] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur (2018) A time-restricted self-attention layer for ASR. In Proc. IEEE ICASSP, pp. 5874–5878. Cited by: §1, §3, §3.
  • [20] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • [21] H. Seki, T. Hori, S. Watanabe, J. Le Roux, and J. R. Hershey (2018-07) A purely end-to-end system for multi-speaker speech recognition. In Proc. ACL, Cited by: §1, §2.1, §2, §4.
  • [22] S. Settle, J. Le Roux, T. Hori, S. Watanabe, and J. R. Hershey (2018) End-to-end multi-speaker speech recognition. In Proc. IEEE ICASSP, pp. 4819–4823. Cited by: §1.
  • [23] M. Souden, J. Benesty, and S. Affes (2009)

    On optimal frequency-domain multichannel linear filtering for noise reduction

    .
    IEEE/ACM Trans. Audio, Speech, Language Process. 18 (2), pp. 260–276. Cited by: §2.2.
  • [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. NIPS, Cited by: §1, §3, §4.1.
  • [25] Z. Wang, J. Le Roux, and J. R. Hershey (2018) Multi-Channel Deep Clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation. In Proc. IEEE ICASSP, pp. 1–5. Cited by: §1.
  • [26] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, et al. (2018) ESPnet: End-to-End Speech Processing Toolkit. In Proc. ISCA Interspeech, pp. 2207–2211. Cited by: §4.1.
  • [27] T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, and F. Alleva (2018) Recognizing overlapped speech in meetings: a multichannel separation approach using neural networks. In Proc. ISCA Interspeech, Cited by: §1.
  • [28] T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita, M. Fujimoto, C. Yu, W. J. Fabian, M. Espi, T. Higuchi, et al. (2015-12) The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices. In Proc. IEEE ASRU, pp. 436–443. Cited by: §2.2.
  • [29] T. Yoshioka and T. Nakatani (2012) Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening. IEEE/ACM Trans. Audio, Speech, Language Process. 20 (10), pp. 2707–2720. Cited by: §1.
  • [30] D. Yu, X. Chang, and Y. Qian (2017) Recognizing multi-talker speech with permutation invariant training. In Proc. ISCA Interspeech, pp. 2456–2460. Cited by: §1.
  • [31] D. Yu, M. Kolbæk, Z. Tan, and J. Jensen (2017-03) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In Proc. IEEE ICASSP, Cited by: §1.