Robust multi-channel speech recognition is a challenging task since the acoustic interferences such as background noise and reverberation degrade the quality of input speech. It is known that an automatic speech recognition (ASR) system, which is trained on clean speech, works poorly in noisy environments due to the mismatch in acoustic characteristics 
. For robust multi-channel ASR, recent studies usually employ a front-end component that leverages a denoising algorithm such as the Minimum Variance Distortionless Response (MVDR) or a dereverberation algorithm (e.g., the Weighted Prediction Error, WPE)[1, 9]. Even though these denoising and dereverberation methods have brought substantial improvements for an ASR system , they are usually designed for enhancing speech in stationary environments.
For handling more realistic acoustic environments, speech enhancement techniques based on deep neural network (DNN) have been developed, which basically require time-aligned parallel clean and noisy speech data for training
. These techniques usually train the model to optimize signal level criteria such as signal to noise ratio (SNR), independently of speech recognition accuracy. To make the speech enhancement algorithm a more efficient front-end for ASR, recent studies have proposed to optimize the speech enhancement model using the ASR objective within an end-to-end ASR scheme[6, 12, 10]. This training method allows to use non-parallel clean and noisy data for training the front-end.
However, since the ASR objective only focuses on preserving the phonetic information of the input speech, it is insufficient for fully training the speech enhancement model and gauranteeing generalized performance improvement. Moreover, conventional approaches do not take into account the distribution of the target clean speech signal on which the original ASR system is trained. To overcome these limitations, we propose a novel method that applies flow-based density estimation to the robust front-end using non-parallel clean and noisy speech. In the proposed method, a flow-based density estimator is trained with clean speech and the front-end receives the additional generative loss from the density estimator. In other words, the front-end performs multi-task learning. The auxiliary objective induces the front-end to learn more regularized representations, which in turn improves the performance of the ASR module on the noisy CHiME-4 evaluation set.
Our main contributions are as follows:
We propose a novel approach that combines density estimation with multi-channel ASR to exploit the probability distribution of speech signal for robust front-end.
We present a new flow-based model MelFlow for estimating the probability distribution of mel-spectrograms.
We demonstrate that our multi-task learning strategy results in better performance on WERs over noisy speech compared to the scheme that depends only on the ASR objective.
2 Related Work
Denoising and dereverberation methods are originally designed to estimate clean signal and can be applied without any training . These traditional methods, however, require time-consuming iterative process and work well only in stationary environmental conditions. To handle more realistic acoustic environments, modern speech enhancement techniques usually employ DNNs to remove background noise and directly estimate the desired signal .
For multi-channel ASR, denoising and dereverberation techniques have been employed as front-ends and reported to produce some improvements in noisy speech recognition . However, the direct application of enhancement-based algorithms to ASR has some problems. One of the main problems is that the ASR accuracy is not taken into consideration when training the front-end, thus the resulting features may lack phonetic information. Another critical problem is that training the conventional enhancement modules require parallel dataset (i.e., pairs of aligned clean and noisy speech signals). To alleviate the problems, recent approaches optimize the front-end and ASR models jointly using the ASR objective [6, 12].
3.1 Neural Beamforming Method
A filter-and-sum beamforming method is a typical denoising technique for enhancing multi-channel signal. In the filter-and-sum beamforing, a speech image at the reference microphone is estimated by using a linear filter operating as follows:
is short-time Fourier transform (STFT) coefficient,is a beamforming filter coefficient and is an estimated speech image. Subscripts denote the -th channel of a signal at a time-frequency bin . While conventional methods optimize based on a signal-level objective, recent studies train jointly within an ASR architecture [11, 12]
. This kind of data-driven approach is called the neural beamforming method and can be classified into two categories: (i) filter estimation approach and (ii) mask estimation approach. The filter estimation approach estimates the time-variant filter coefficientsdirectly but suffers from unstable training due to high flexibility . On the other hand, the mask estimation approach optimizes time-invariant filter coefficients and has been reported to achieve improved performances in multi-channel speech recognition . Also, the mask estimation approach can be applied to any microphone configurations. Given the advantages of the latter, this paper focuses on the mask estimation approach.
Mask estimation approach.
To get time-invariant coefficients , we first calculate a speech mask . An input feature is the aggregation of the amplitudes of the -th channel’s time-frequency bin along the frequency axis at time :
The speech mask is acquired from the input feature as follows:
where BiLSTM is a real-valued bidirectional LSTM network, is the output of BiLSTM and FClayer is a fully connected network from . A cross-channel power spectral density (PSD) matrix for a speech signal can be obtained with a channel-averaged mask as follows:
is the channelwise concatenated vector of the STFT coefficients andrepresents Hermitian transpose. Using the same architecture with different parameters , , another PSD matrix for a noise signal is derived in the same way. Finally, the time-invariant linear filter coefficient is computed with the MVDR formulation as follows:
where is the trace operator and is the one-hot vector indicating the index of a reference microphone. We can integrate another network to estimate the reference microphone in case the index of the reference is not specified and not available.
ESPnet  is an end-to-end ASR which is based on connecionist temporal classification (CTC) and attention mechanism. ESPnet has an attention-based encoder-decoder structure and shares encoder representations to optimize both CTC and attention-based cross entropy objectives jointly. This joint multi-task learning framework has been known to improve performance and achieve fast convergence . ESPnet also incorporates the neural beamforming method as a pre-processor and optimizes the front-end within the end-to-end ASR scheme. For decoding, the weighted average of attention-based and CTC scores is used to eliminate irregular alignments. We use ESPnet as a base ASR architecture and integrate a density estimator into the multi-channel ASR in the next section.
4 Proposed Model
Here, we propose to incorporate a flow-based density estimation task within the multi-channel end-to-end ASR. Our insight is that the ASR objective is insufficient for fully training the front-end since it only focuses on preserving the phonetic information. For the robust front-end, we now exploit the distribution of the target clean speech on which the original ASR system is trained.
We also present a novel flow-based generative model MelFlow for estimating the likelihood of mel-spectrograms. Many flow-based models for estimating the distribution of raw audio have been studied [15, 14], but have not been applied to mel-specograms. We introduce a new flow-based model for mel-spectrograms and explain the architecture of MelFlow.
4.1 Flow-based Generative Model
A flow-based model is a generative model which consists of a stack of invertible mappings from a simple distribution to a complex distribution . Let be a mapping from to , and ( for ). Then is transformed into through a chain of invertible mappings:
By change of variables theorem, the log-likelihood of data is expressed as follows:
By maximizing , we obtain a density estimator of data . A typical choice for the prior distribution is a zero-mean isotropic multivariate Gaussian . If is obtained by Eq. (8), the first term in Eq. (9) can be calculated easily. However, it is too expensive to compute the second term directly. To reduce the computational complexity, is required to have a tractable Jacobian. The affine coupling layer  satisfies such a requirement and is defined as follows:
where is the first half, is the last half of , () and () are the functions from , and stands for the element-wise product. The Jacobian matrix of the affine coupling layer is a lower triangular matrix and can be computed efficiently:
where is the -th elemtent of .
We now turn to building a density estimator for mel-spectrograms. An input data is a mel-spectrogram where is a fixed frequency bin length of the mel-spectrograms and is a variable time-length depending on an utterance. It is splited into and along the frequency axis, which is depicted in Figure 2. Let and . Throughout one of flow stacks, and are transformed in a different way: remains and is transformed into as follows
where WaveNet2D can be any function and a multiplicative term and an additive term depend only on
. In this work, we use multiple layers of dilated 2D-convolutions with gated-tanh nonlinearities, residual connections and skip connections for WaveNet2D. WaveNet2D is similar to WaveNet, but different in that WaveNet2D is composed of non-causal 2D-convolutions. The Jacobian determinant in Eq. (12) is computed as follows:
MelFlow achieves a more flexible and high-expressive model by stacking multiple flow operations as illustrated in Figure 3. Also, we change the order of and before each flow operation (i.e. and are transformed by Eq. (15) alternately).
We can further improve the model by exploiting target labels. Note that we use the density estimator to train the front-end and target labels are available during training. As shown in Figure 1, a target label is first embedded into a sequence of vectors through Character Embedding. A stack of 3 convolutional layers is applied to the sequence of vectors and the output is passed into a bidirectional LSTM. A label embedding is finally obtained by summing the last hidden state of the forward path and the backward path in the bidirectional LSTM:
Label Embedder can be considered as a compact version of the encoder in . The attention mechanism isn’t included due to the restriction of GPU memory which is mostly occupied by the ASR module. We now reformulate Eq. (13) by adding a global condition to WaveNet2D as follows:
Considering the fact that is a variable time-length, the generative loss (the negative log-likelihood, NLL) is defined as:
where is a fixed frequency-bin length and varies by the input data .
4.3 Joint Training with Density Estimation
The proposed model incorporates the density estimator to the base ASR architecture as shown in Figure 1.
We take advantage of non-parallel clean and noisy speech data by employing the density estimation task. To be specific, when clean speech data comes in a mini-batch, the ASR module and the density estimator are trained ordinarily. If a noisy speech data comes in the mini-batch, cases are divided into two. In the first case with a probability , the ASR module receives a randomly chosen channel of the noisy speech. In the other case, the ASR module receives enhanced speech from the neural beamformer. The ASR module and the neural beamformer are trained for both cases while the density estimator is only used for computing . Algorithm 1 describes a joint training step with density estimation.
In order to evaluate the proposed method in a noisy speech scenario, we conducted a set of experiments using the CHiME-4 dataset.
CHiME-4 is a speech recognition dataset which is recorded by a multi-microphone tablet device in every day, noisy environments. The tablet device is equipped with 6-channel microphones where 5 of them face forward and the other one faces backward. In this work, we excluded the speech data recorded by the microphone facing backward; hence the number of channels was 5. CHiME-4 employs two types of data: (i) speech data recorded in real noisy environments (i.e., on a bus, cafe, pedestrian area, and street junction), and (ii) simulated speech data that is generated by manually mixing clean speech data with background noise. Also, the dataset is divided into training, development and evaluation sets. The training set consists of 3 hours of real noisy utterances from 4 speakers and 15 hours of simulated noisy utterances from 83 speakers. The development set consists of 2.9 hours of real and simulated noisy utterances from 4 speakers, respectively. The evaluation set consists of 2.2 hours of utterances for each real and simulated noisy data.
We also employed Wall Street Journal (WSJ) read speech for single channel clean speech dataset. WSJ’s si-284 set contains 82 hours of clean utterances and was used only for training the model.
5.2 Model Configurations
To compute 200 STFT coefficients (i.e., =201), the 25ms-width Hanning window with a 10ms shift was used. We used a 3-layer bidirectional LSTM with 300 cells for BiLSTM in Eq. (3). Also, a linear projection layer with 300 units was inserted after every layer of bidirectional LSTM. For FClayer in Eq. (4
), a 1-layer linear transformation was used. To estimate the reference microphone, a 2-layer linear transformation was used with tanh activation. The reference microphone vectorwas finally estimated using the softmax function.
STFT coefficients were converted to mel-spectrograms by LogMel. The mel-scale is primarily used to mimic the non-linear human ear perception of sound. In our experiments, was 80.
We used a 16-dimensional character embedding. The kernel sizes of 1D convolutional layers were set to be 3 and the sizes of input and output were the same as 16. The ReLU activation and the batch normalization were used at the end of each convolutional layer. We stacked 3 convolutional layers. The sizes of the hidden state in the bidirectional LSTM were 256 and the 2 last hidden states of the forward and backward paths were summed to obtain the label embeddingwhere was set to be 256.
We used MelFlow consisting of 8 affine coupling layers. For each WaveNet2D, the kernel sizes for the first and last convolutional layer were set to be 1. The rest of the layers (i.e., middle 4 layers) was composed of 20 channels and kernel with size 3, and used for residual connections, skip connections and gated-tanh unit. For conditioning the label embedding globally, a fully connected layer was included in WaveNet2D. All the weights of the last convolutional layers in WaveNet2D were initialized to be zero. This initialization has been known to result in the stable training procedure.
For Encoder, a 4-layer 2D convolutional network and a 3-layer bidirectional LSTM with 1024 cells were used. The kernel sizes were set to be (3,3) for all layers in the convolutional network and channels were set to be (1, 64), (64, 64), (64, 128) and (128, 128), respectively. A linear projection layer with 1024 units was inserted after every layer of bidirectional LSTM in Encoder. To boost the ASR optimization, we adopted a joint CTC-attention loss function. For CTCnet, we used a 1-layer linear transformation with output dimension 52 indicating characters. For Decoder, a unidirectional LSTM with 1024 cells and a 1-layer linear transformation were used. To connect Encoder and Decoder, we leveraged the attention mechanism.
When the ASR module is trained with only the attention loss, it usually suffers from misalignment because the attention mechanism is too flexible to predict the right alignments. It has been reported that the CTC loss enforces monotonic alignments between speech and label sequences due to the left-to-right constraint 
. Thus the auxiliary CTC loss helps the attention model to have proper alignments and boosts the whole training procedure. The CTC losscan be calculated efficiently with the forward-backward algorithm and the attention loss is also easily obtained with a teacher forcing method at Decoder. The joint CTC-attention objective is expressed as follows with a tuning parameter :
where we set to 0.5 for the experiments.
The total loss is defined as:
is a hyperparameter. We experimented with different values of.
We used ESPnet as the baseline. The baseline doesn’t have Label Embedder and MelFlow in Figure 1. All the other configurations were the same as the proposed model.
|development set||evaluation set|
|Model||simulated data||real data||simulated data||real data|
|w/o label condition||Proposed Model ( = 1)||8.9||9.5||13.2||17.3|
|Proposed Model ( = 0.25)||8.8||9.1||12.7||17.0|
|Proposed Model ( = 0.1)||8.7||9.1||13.2||17.4|
|Proposed Model ( = 0.01)||9.1||8.9||13.2||17.2|
|with label condition||Proposed Model ( = 1)||8.6||9.1||12.9||16.7|
|Proposed Model ( = 0.25)||8.1||9.0||13.1||16.8|
|Proposed Model ( = 0.1)||8.5||9.1||13.2||16.7|
|Proposed Model ( = 0.01)||8.4||8.9||13.3||16.3|
|Proposed Model ( = 1)||14.44||0.82||1.83|
|Proposed Model ( = 0.25)||15.78||0.83||1.87|
|Proposed Model ( = 0.1)||15.85||0.83||1.88|
|Proposed Model ( = 0.01)||15.87||0.83||1.88|
We compared the noisy speech recognition performances of the baseline and the proposed model on the CHiME-4 dataset. The baseline was trained with only the ASR objective . We used 2 types of the proposed model in the experiment: one with both Label Embedder and MelFlow and the other with MelFlow. Also, various values of the hyperparameter in Eq. (21) were used in the experiments: 1, 0.25, 0.1 and 0.01. Attention scores and CTC scores were averaged at a ratio of 7:3 and a beam search algorithm with the beam size 20 was used for decoding. An RNN-based language model was also used to enhance the quality of speech recognition. Word error rates (WERs) of the outputs of the different models are shown in Table 1. Overall, the proposed model without the label condition showed better performances than the baseline. For , the proposed model outperformed the baseline with an absolute decrease of 0.9% in terms of WER on the simulated noisy data in the evaluation set. However, the improvement was not obvious over the real noisy data in both development and evaluation sets. When the label condition was incorporated into the model, the overall performance showed significant improvement and, surprisingly, the WERs of the proposed model were improved in all cases. For , the average WER on the real noisy data in the evaluation set achieved 16.3%. The experiment demonstrates that the auxiliary objective from the density estimation task leads the front-end to learn more general representations and this leads to the improved performance of noisy speech recognition. The difference of performances between the models with/without the label condition suggests that the accurate density estimator should be used in order that the front-end gets more benefits from the generative loss.
One may ask whether the proposed model achieves improvements in respect of speech enhancement. Unfortunately, the answer is no. Speech enhancement scores are illustrated in Table 2. We evaluated speech-to-distortion ratio (SDR ), extended short-time objective intelligibility (ESTOI ), and perceptual evaluation of speech quality (PESQ ) between the enhanced speech and the reference speech in the evaluation set. The CHiME-4 dataset provides the clean data recorded by the close-talk microphone and we used this data as the reference speech. We used the proposed model with the label condition for the speech enhancement evaluation. The overall scores of the proposed model were almost same as the ones of the baseline. This implies that in the proposed model the representation after the front-end is generalized and useful for the ASR module but this doesn’t necessarily mean the improvement of the metrics of speech enhancement. Multi-task learning of speech enhancement and density estimation could be beneficial for the speech enhancement scores and we leave it for future work.
In this work, we presented the novel method which employs flow-based density estimation for robust multi-channel ASR. We also proposed MelFlow to estimate the distribution of mel-spectrograms of clean speech. In the experiments, we demonstrated that the proposed model shows better performance than the conventional ASR model in terms of word error rate (WER) on noisy multi-channel speech data. We verified that the auxiliary generative objective helps the front-end to learn more regularized representations which lead to improvements on noisy speech recognition.
This work was supported by Samsung Research Funding Center of Samsung Electronics under Project Number SRFC-IT1701-04.
-  (2017) The third ‘chime’speech separation and recognition challenge: analysis and outcomes. Computer Speech & Language 46, pp. 605–626. Cited by: §1.
-  (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §4.1.
-  (2016) Improved mvdr beamforming using single-channel mask prediction networks.. In Interspeech, pp. 1981–1985. Cited by: §1.
-  (1995) Speech recognition in noisy environments: a survey. Speech communication 16 (3), pp. 261–291. Cited by: §1.
-  (2015) Learning spectral mapping for speech dereverberation and denoising. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (6), pp. 982–992. Cited by: §2.
-  (2019) Joint optimization of neural network-based wpe dereverberation and acoustic model for robust online asr. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6655–6659. Cited by: §1, §2.
-  (2016) An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (11), pp. 2009–2022. Cited by: §6.
-  (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4835–4839. Cited by: §3.2, §5.2.
-  (2016) A summary of the reverb challenge: state-of-the-art and remaining challenges in reverberant speech processing research. EURASIP Journal on Advances in Signal Processing 2016 (1), pp. 7. Cited by: §1.
-  (2017) Neural network-based spectrum estimation for online wpe dereverberation.. In Interspeech, pp. 384–388. Cited by: §1.
Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 271–275. Cited by: §3.1.
Multichannel end-to-end speech recognition.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2632–2641. Cited by: §1, §2, §3.1.
-  (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §4.2.
-  (2018) Clarinet: parallel wave generation in end-to-end text-to-speech. arXiv preprint arXiv:1807.07281. Cited by: §4.
-  (2019) Waveglow: a flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621. Cited by: §4.
-  (2001) Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 2, pp. 749–752. Cited by: §6.
-  (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §4.2.
-  (2013) The second ‘chime’speech separation and recognition challenge: an overview of challenge systems and outcomes. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 162–167. Cited by: §1.
-  (2006) Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: §6.
-  (2018) Espnet: end-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015. Cited by: §3.2.
-  (2009) Distant speech recognition. Wiley Online Library. Cited by: §2, §2.