Automatic speech recognition (ASR) systems find widespread use in applications like human-machine interface, virtual assistants, smart speakers etc, where the input speech is often reverberant and noisy. The ASR performance has improved dramatically over the last decade with the help of deep learning models[yu2016automatic]
. However, the degradation of the systems in presence of noise and reverberation continues to be a challenging problem due to the low signal to noise ratio[hain2012transcribing]. For e.g. Peddinti et al, [peddinti2017low] reports a rel. increase in word error rate (WER) when signals from a far-field array microphone are used in place of those from headset microphones in the ASR systems, both during training and testing. This degradation could be primarily attributed to reverberation artifacts [yoshioka2012making, kinoshita2013reverb]. The availability of multi-channel signals can be leveraged for alleviating these issues as most of the real life far-field speech recordings are captured by a microphone array.
Previously, many works have focused on far-field speech recognition using multiple microphones [far1, far2, far3, far4]. The traditional approach to multi channel far-field ASR combines all the available channels by beamforming [anguera2007acoustic] and then processes the resulting single channel signal effectively. The technique of beamforming attempts to find the time delay between channels and boosts the signal by weighted and delayed summation of the individual channels [wolfel2009distant, delcroix2015strategies]. This approach is still the most successful system for ASR in multi-channel reverberant environments [barker2018fifth].
In this paper, we propose an approach to avoid the beamforming step by directly processing the multi-channel features within the ASR framework. A feature extraction step is proposed that is based on multi-variate autoregressive (MAR) modeling exploiting the joint correlation among the three dimensions of time, frequency and channel present in the signal. A novel neural network architecture for multi-channel ASR is also proposed that contains network-in-network (NIN) in a 3-D convolutional neural network (CNN) architecture. With several ASR experiments conducted on CHiME-3[chime3] and REVERB challenge dataset [rev1, rev2], we show that the proposed approach to multi-channel feature and acoustic modeling improves significantly over a baseline system using conventional beamformed audio with mel filter bank energy features.
The rest of the paper is organized as follows. The related prior works are discussed in Section 2. The details about the proposed 3-D features are provided in Section 3. Section 4 elaborates the proposed model architecture for multi-channel ASR. The ASR experiments and results are reported in Section 5, which is followed by a summary in Section 6.
2 Retaled Prior Work
While the original goal of beamforming [anguera2007acoustic] is directed towards signal enhancement, the beamforming cost can be modified for maximizing the likelihood [seltzer2004likelihood]. With the advent of neural network based acoustic models, multi-channel acoustic models have also been explored. Recently, Swietojanski et al [swietojanski2014convolutional] proposed the use of features from each channel of the multi-channel speech directly as input to a convolutional neural network based acoustic model. Here, the neural network is seen as a replacement for conventional beamformer. Joint training of a more explicit beamformer with the neural network acoustic model has been proposed by Xiao et al., [xiao2016deep]. Training of neural networks, which operate on the raw signals that are optimized for the discriminative cost function of the acoustic model, has also been recently explored. These approaches are termed as Neural Beamforming approaches as the neural network acoustic model subsumes the functionality of the beamformer [sainath2017multichannel, ochiai2017unified].
Previously, we had explored the use of 3-D CNN models in [ganapathy]
, where the network was fed with the spectrogram features of all channels. Separately, a multi-band feature extraction using autoregressive modeling was proposed for deriving noise robust features from single channel speech[marsri, ganapathy2018far].
In this paper, we use the multi-variate autoregressive modeling features (MAR) from the microphone array to derive 3-D features. We also extend the previous work on 3-D CNN models [ganapathy] with a newer architecture that combines the multiple channel features in a NIN framework.
3 3-D MAR features
Multi variate autoregressive (MAR) modeling was proposed to derive robust features in the joint time-frequency domain[marsri]
. In this case, the model assumes that the discrete cosine transform (DCT) components of different frequency sub-bands can be jointly expressed in a vector linear prediction process. The frame work relies on frequency domain linear prediction which states that linear prediction applied on frequency domain estimates the envelopes of the signal[athineos2003frequency, sriphd]. We review the mathematical model of the MAR feature extraction and go on to propose the model for multi-channel feature extraction.
3.1 Discrete Cosine Transform (DCT)
Let denote a discrete sequence. The DCT is given by,
where for , otherwise.
3.2 Frequency Domain Linear Prediction
FDLP is the frequency domain dual of Time Domain Linear Prediction (TDLP). Just as TDLP estimates the spectral envelope of a signal, FDLP estimates the temporal envelope of the signal, i.e. square of its Hilbert envelope [analytic]
. Temporal envelope is given by the inverse Fourier transform of the autocorrelation function of DCT.
We use the autocorrelation of the DCT coefficient to predict the temporal envelope of the signal. One of the inherent property of linear prediction is that, it tries to approximate the peaks very well. The FDLP model tries to preserve the peaks in temporal domain.
3.3 Multi-channel feature extraction
For each channel, we use a second window for DCT computation. We partition the DCT signal in frequency domain by multiplying with a window . The window functions are chosen to be uniformly spaced in the mel-scale and have a Gaussian shape [o1987speech]. Let denotes the sub band DCT. The corresponding sub bands of all the channels are appended to form a vector .
where denotes the windowed sub band from the first channel and is the number of available channels. We perform a vector linear prediction on the signal . This will reveal the multivariate autoregressive model of the signal.
3.4 Multi variate Autoregressive Modeling
The dimensional wide sense stationary vector process is said to be autoregressive [vaidyanathan2007theory] if it is generated by a recursive difference equation of the form
where is an
dimensional white noise random process with a covariance matrixand the MAR coefficients are square matrices of size which characterize the model [lutp].
We use the autocorrelation method for the solution of normal equation to find the model parameters [vaidyanathan2007theory]. The forward prediction polynomial is given by
where represents complex time domain variable [kumaresan1999model]. The optimal predictor is solved by minimizing the mean square error as follows.
where is the autocorrelation matrix of the WSS process for lag given by
Here, denotes the expectation operator and represents the transpose of . The estimate of the error covariance matrix which is Hermitian is given by
3.5 Envelope Estimation
The goal of performing linear prediction in our case is to estimate the temporal envelopes. In this paper, the input denotes DCT coefficients indexed by for the sub band from all channels. The corresponding Hilbert envelopes are estimated using MAR modeling. If denotes the multi dimensional Riesz envelope (extension of Hilbert envelope to 2-D signals) [riesz] of multi channel speech for sub-band , then the MAR estimate of the Riesz envelope is given by the following equation
where with given by equation (5) and . By estimating for each sub band, we reconstruct the temporal envelopes of all the channels and all sub bands. Re-arranging sub band envelopes gives the 3-D feature representation.
3.6 Gain normalization
In order to reduce the dynamic range of envelopes we normalize the magnitude of envelope over the two second computation window. This has the effect of suppressing additive noise artifacts [sriphd]. It is to be noted that the gain normalization of band energies is done for CHiME-3 dataset (which has additive noise), but not on REVERB Challenge dataset.
3.7 Multi channel Feature Extraction for ASR using MAR
The block schematic of the proposed multi channel feature extraction is shown in Figure 1. Long segments of speech from each channel are taken (non- overlapping 2 sec duration) and are transformed by DCT. The full band DCT is windowed into overlapping 40 sub bands. This data is fed into the MAR feature extraction block. The estimation procedure of the multivariate AR model is applied and model parameters are estimated. We chose . The sub band MAR envelopes are integrated with a Hamming window over a 25 ms window with a 10 ms shift. The integration in time of the sub band envelopes yields an estimate of the MAR spectrogram of the input speech signal.
4 Model Architectures
The proposed 3-D CLSTM architecture is shown in Figure 2. The input data consists of 21 frames of 40 bands from all the
channels. The input data to 3-D CLSTM model is a 3-D tensor of sizein the first layer, followed by a 2-D CNN layer with 128 kernels of size 133 in the second. This is followed by maxpooling and two 2-D CNN layers with 64 filters of kernel size 133. The output of the convolution layers is fed to an LSTM [lstm] which performs frequency recurrence. This is followed by a fully connected layer [mlp], which predicts the senone classes. Dropout [dropout]batch]
are used for regularization. The model training is performed using Pytorch software[pytorch].
In order to enhance the learning of the non linearity in the filters of the 3-D CNN layer, we use the Network in Network (NIN) [nin] architecture. The NIN is used in the first layer to learn the non linearity present in the filters. The first layer performs the equivalent of neural beamforming while successive layers have only 2-D representation.
The 2-D CLSTM architecture used in the case of beamformed audio, is a special case of the proposed 3-D CLSTM architecture, where the input is a 2-D spectrogram of size 2140 and a normal 2-D Convolution is performed in the initial layer. The rest of the network architecture starting from layer-2 of model (shown in Figure 2) is used for the 2-D CLSTM model on beamformed audio features.
|+ Batchnorm, Adam||8.4||11.3||16.0||17.7|
5 Experiments and Results
The experiments are performed on CHiME-3 and REVERB Challenge datasets. For the baseline model, multiple architectures are experimented using beamformed FBANK (40 band mel spectrogram with frequency range from 200 Hz to 6500 Hz) as the features (Table 1). The 2-D CNN architecture gives a significant improvement over the DNN. Adding dropouts helped improve the performance further. Batch normalization and Adam optimizer also showed marginal improvement over the 2-D CNN model with dropout. Finally, we propose a new CLSTM architecture with the LSTM recurring over frequency. This served as the baseline for our experiments on the multi-channel data.
We also perform experiments with multi-band feature extraction [marsri] on the beamformed audio (BF-MB), using the 2-D CLSTM architecture.
5.1 CHiME-3 ASR
The CHiME-3 dataset [chime3] for the ASR has multiple microphone tablet device recording in four different environments, namely, public transport (BUS), cafe (CAF), street junction (STR) and pedestrian area (PED). For each of the above environments real and simulated data are present. The real data consists of channel recordings from WSJ0 corpus sampled at kHz spoken in the four varied environments. The simulated data was constructed by mixing clean utterances with the environment noise. The training dataset consists of (real) noisy and (simulated) noisy utterances from speakers. The development (Dev) and evaluation (Eval) datasets consists of from speakers and from other speakers real Dev and Eval data respectively. Identically sized simulated Dev and Eval datasets are also present.
|3-D CNN Config.||Dev.||Eval.|
|3D kernels (2 layers)||9.9||9.8||9.9||19.2||12.7||15.9|
|3-D kernels (1 layer)||10.1||10.5||10.3||19.2||14.0||16.6|
|+ NIN (1 hidden layer)||10.2||10.3||10.3||19.8||13.6||16.7|
The effect of different CNN configurations in the first two layers of the proposed 3-D CLSTM architecture is reported in Table 2. Although removing the channel level information in the first layer (L1) reduces the performance of the ASR compared to removing it in the first two layers (L1+L2), with NIN and dropout the former becomes better. Performance of the ASR improves over the baseline (BF-FBANK) by using BF-MB features as shown in Table 3.
We compare the performance of the proposed 3-D Feature and acoustic model, named as MC-MAR with the baseline (Table 3). In the multi-channel experiments, 5 channel recordings are taken and multi-channel features are extracted. All the filter bank and MAR features are extracted by keeping the number of bands as 40 and the data is trained using the proposed 3-D CLSTM architecture.
The results for multi-channel ASR experiments on CHiME-3 dataset are shown in Table 1, 2, 3 & 4. It can be seen that the proposed 3-D features and 3-D CLSTM model has average relative improvement of % in WER for CHiME-3 dataset. The 3-D CNN model on FBANK features (MC-FBANK) [ganapathy] also shows a marginal improvement over the beamformed FBANK (BF-FBANK) features.
|BUS||12.0 (9.9)||8.1 (9.1)||23.5 (20.3)||9.3 (11.6)|
|CAF||8.7 (8.4)||11.9 (14.0)||16.0 (16.0)||13.9 (19.0)|
|PED||7.1 (6.8)||8.0 (9.4)||18.0 (15.6)||13.2 (18.1)|
|STR||9.3 (8.6)||9.5 (11.6)||11.8 (12.1)||14.3 (19.8)|
5.2 REVERB Challenge ASR
The REVERB Challenge dataset [rev3] for ASR consists of 8 channel recordings. Real and Simulated noisy speech data are present. Simulated data is comprised of reverberant utterances generated based on the WSJCAM0 corpus [rev1]. These utterances were artificially distorted by convolving clean WSJCAM0 signals with measured room impulse responses (RIRs) and adding noise at an SNR of 20 dB. SimData simulated six different reverberation condition. Real data, which is comprised of utterances from the MC-WSJ-AV corpus [rev2] consists of utterances spoken by human speakers in a noisy and reverberant room. The training set consists of 7861 uttrances (92 speakers) from the clean WSJCAM0 training data by convolving the clean utterances with 24 measured RIRs and adding noise at an SNR of 20 dB. The development (Dev) and evaluation (Eval) datasets consists of 1663 (1484 simulated and 179 real) and 2548 (2176 simulated and 372 real) utterances respectively. The Dev and Eval datasets have 20 and 28 speakers respectively.
The results for multi-channel ASR experiments on REVERB Challenge are shown in Table 5. It can be seen that the proposed 3-D features and 3-D CLSTM model provides average relative improvement of % in WER for REVERB Challenge dataset over the BF-FBANK 2-D CLSTM baseline. The trends observed in REVERB Challenge are also similar to those seen in CHiME-3 dataset.
In this paper, we propose a new framework of multi-channel features using MAR modeling in the frequency domain. We also propose 3-D CNN model for neural beamforming. Various speech recognition experiments were performed on the CHiME-3 dataset as well as the REVERB Challenge dataset. The main conclusion of our experiments is that using multi-channel acoustic model, the performance of ASR can be improved for far-field speech. The analysis also highlights the incremental benefits achieved for various feature and model architecture combinations.