Improved Frequency Modulation Features for Multichannel Distant Speech Recognition

11/23/2018 ∙ by Isidoros Rodomagoulakis, et al. ∙ National Technical University of Athens 0

Frequency modulation features capture the fine structure of speech formants that constitute beneficial and supplementary to the traditional energy-based cepstral features. Improvements have been demonstrated mainly in GMM-HMM systems for small and large vocabulary tasks. Yet, they have limited applications in DNN-HMM systems and Distant Speech Recognition (DSR) tasks. Herein, we elaborate on their integration within state-of-the-art front-end schemes that include post-processing of MFCCs resulting in discriminant and speaker adapted features of large temporal contexts. We explore 1) multichannel demodulation schemes for multi-microphone setups, 2) richer descriptors of frequency modulations, and 3) feature transformation and combination via hierarchical deep networks. We present results for tandem and hybrid recognition with GMM and DNN acoustic models, respectively. The improved modulation features are combined efficiently with MFCCs yielding modest and consistent improvements in multichannel distant speech recognition tasks on reverberant and noisy environments, where recognition rates are far from human performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Modulation features stemming from the AM-FM speech model were originally conceived for ASR [1] as capturing the second-order non-linear structure of speech formants, providing complementary information to the traditional energy-based cepstral features (e.g., MFCCs and PLPs). Their fusion presents robustness in noisy and mismatched conditions as indicated in recent works [2, 3]. However, only a few works [4, 5] examine their performance in DSR tasks with reverberation. Recently, bottleneck Multilayer Perceptons (MLPs) have been proposed in [2]

to combine frequency micro-modulation features with PLPs using network’s non-linear transformations instead of Linear Discriminant Analysis (LDA) which is suboptimal for non-Gaussian features. Following the tandem approach 

[6], improved and deeper nets were proposed in [7, 8], while hierarchical architectures [9] were beneficial for feature combination.

Deep Neural Networks (DNNs) have resulted in innovative ways to improve feature extraction and acoustic modeling in speech recognition 

[10]. Recently, end-to-end systems [11]

have been developed to combine all recognition stages into Recurrent Neural Networks (RNNs) of long memory in order to transform unsegmented sequences of raw speech signals into sequences of phone labels, outperforming in many cases the hybrid DNN-HMM state-of-the-art systems. However, they require large amounts of data and processing capacity while poor performance persists due to high levels of noise and reverberation in many DSR scenarios 

[12, 13] commonly found in modern applications.

Although DNNs can learn many types of variation depending on the training data, they can be sensitive to data mismatches, while feature transformations learned in a data-driven way may not generalize well for out-of-domain data. Model adaptation with regularization mechanisms [14] and iVector based adaptation [15] have been proposed for coping with unseen acoustic data. However, robust acoustic features are typically used to improve acoustic models when dealing with noisy and channel-degraded acoustic data. A comprehensive survey on robust feature extraction strategies and features for DNN-based recognition can be found in [16].

Multi-microphone setups with array processing [17] offer flexibility on multi-source and noisy acoustic scenes by capturing the spatial diversity of speech and non-speech sources, allowing more sophisticated front-ends with channel combination [18], beamforming [19] and speech enhancement [20], which were recently revised and solved with DNNs. However, the most significant improvements have been achieved with multi-style training on multichannel data [21, 22]

, while incorporating deep learning in traditional array processing methods is still under investigation.

Our goal in this work is to increase the robustness of frequency modulation features in noise and reverberation in order to combine them efficiently with standard MFCC-based frond-ends for state-of-the-art speech recognition with GMM and DNN acoustic models. First, we propose a Multichannel, Multiband Demodulation (MMD) scheme that utilizes the noise diversity across microphone array signals aiming at improved demodulation of speech resonances and more accurate estimations of instantaneous modulations 

[23]. Secondly, we explore richer representations of the estimated modulations either by applying signal compression on the raw signals, or by transforming mid-duration temporal contexts of their first-order statistics into hierarchical deep bottleneck networks, which are able to combine both non-linear transformation and fusion of heterogeneous features. Finally, we incorporate the proposed features combined with MFCCs in standard recognition recipes leveraging multi-style training and beamforming. Experiments are conducted in simulated and real data with strong background noise and reverberation.

Section II presents the proposed MMD approach with indicative results on the demodulation of speech phonemes; Section III describes the extraction of frequency modulation features along with the proposed hierarchical bottleneck DNN scheme; The experimental framework and the employed DSR corpora are described in Section IV, while Sections  V and  VI present the obtained results and the conclusions of the paper.

Ii Multichannel, Multiband Demodulation

The proposed demodulation scheme exploits the spatial diversity of noise exhibited across the recordings

(1)

of a microphone array capturing the clean source speech signal in the continuous time domain . Note that reverberation effects and time alignment issues between are not taken into account in the following analysis. The recordings can be decomposed into frequency bands for the derivation of their bandpass components , which correspond to speech resonances. The th resonance of the recording can be modeled by an AM–FM signal as

(2)

where and are its instantaneous amplitudes and angular frequencies. We can track the energy of the source that produced the signal via the Teager-Kaiser energy operator (TEO) [24]

(3)

where . The TEO is the basic ingredient of the Energy Separation Algorithm (ESA) [25] to demodulate the bandpass speech signals into instantaneous amplitude and frequency components. Bandlimited speech components are obtained by decomposing with a Mel-spaced Gabor filterbank :

(4)

where corresponds to the impulse response of the bandbass Gabor filter over band . Given the correlated th bandpass signals from any two microphones and , their interaction can be described by the cross-Teager energy operator [26, 27]:

(5)

which in general measures the relative rate of change between two oscillators. As discussed in [28] and [29], two useful properties of the operator can be derived:

1) On averaging, noise contributes as an additive error term to the Teager energy of the th resonance of the source signal:

(6)

The above stands assuming that the additive noise component is a zero mean, wide-sense stationary (WSS) Gaussian random process. Consequently, the energy with the minimum average

(7)

which is formed by microphones , is expected to lie closer to .

2) Instead of searching among all pairs of microphones, which is computationally intensive111 computations are needed for each band because , it suffices to search between microphones and having the 1st and 2nd smallest average Teager energies:

(8)

Based on the above, we track in medium-duration frames in order to obtain an energy signal which is less affected by noise. Then, we modify ESA, where instead of using single-channel energies , we estimate the instantaneous amplitudes and angular frequencies using the denoised cross-channel energies as:222The microphone index is removed from now on as an indication of multi-microphone estimation.

(9)
(10)

For computanional efficiency and smoother estimations, we compute cross-Teager energies by including bandpass filtering with Gabor filters within the cross-Teager operator:

(11)
(12)

Ii-a Analysis on TIMIT corpus

(a) vowels
(b) nasals
(c) stops
(d) fricatives
Fig. 1: Relative reduction (%) of frequency demodulation RMS error per phoneme category achieved by the proposed MMD approach compared to single-channel Gabor-ESA demodulation.

The robustness of the proposed MMD method is tested on simulations of noisy and far-field speech after distorting the TIMIT corpus. MMD is compared to the single-channel Gabor-ESA [30] in terms of frequency demodulation RMS error which is computed across bands and between the average instantaneous frequencies of clean and noisy signals. Clean phonemes are convolved with room impulse responses simulated using the Image-Source Method (ISM) [1] to match the environment of 1) a livingroom, 2) a meeting room, and 3) a class room. Randomly selected noises from the RWCP sound scene database [31] are added in order to simulate noisy domestic backgrounds of SNRs varying from 20dB to -15dB. The simulated microphone setup includes three microphones arranged in a 30-cm equidistant linear array located in the center of each room where a moving source is assumed to form a small spiral trajectory three meters away from the array. Overall, 100 instances of each phoneme are simulated for each of the 21 SNR- combinations, resulting in approximately 2k signals. As evidenced in Fig. 1, the relative improvements gained by using MMD increase as conditions get more difficult, especially for low SNR values where it appears that the robustness of Teager energy in low-frequency bands [32] benefits vowels the most compared to nasals, stops, and fricatives in which offers modest improvements.

Iii Frequency Modulation Features

First order statistics over the frequency micro-modulations have yielded improved results combined with MFCCs in noisy LVCSR tasks [2]. Herein, the Mean Instantaneous Frequencies (MIF) are considered as the basic modulation features, which are the average instantaneous frequencies of each band over frames of ms that are processed every ms. MIFs are compared with richer descriptors for DNN acoustic modeling, such as the proposed Compressed Instantaneous Frequencies (CIF) and the bottleneck features derived from hierarchical deep networks as described in the following paragraphs. The compared frequency modulation features are extracted after single- and multi-channel demodulation following the proposed MMD approach.

Iii-a Compressed Instantaneous Frequencies (CIF)

As depicted in Fig. 2, the estimated instantaneous frequencies contain periodic patterns that can be described compactly with a few of its Discrete Cosine Transform (DCT) coefficients. The exact number of the selected coefficients is a trade-off between the reconstruction error they achieve and their dimensionality compared to the fidelity of the employed network in which they are fed for the extraction of bottleneck features from larger temporal contexts. An example of reconstructing the instantaneous frequencies in each band after using 10 DCT coefficients is also depicted in Fig. 2. Generally speaking, modulations are expected stronger and more noisy in higher bands where the filters are wider.

Fig. 2: Instantaneous frequency modulations in six Mel-spaced bands of phoneme “ah” and their reconstructions (red thick lines) using 10 DCT coefficients.

Iii-B Hierarchical Deep Bottleneck Features

The complementary MFCC and frequency modulation features are transformed and combined through a hierarchical network of bottleneck DNNs for the extraction of long-term deep features which in turn are augmented with speaker adapted features. As shown in Fig. 

3

, first, compression of 9-frame temporal contexts is realized for each feature set through bottleneck networks. Subsequently, the activations of their bottleneck layers are concatenated and given in 9-frame vectors to the combination network after reducing their dimensionality by applying Principal Component Analysis (PCA), retaining 95% of the total variability. The final feature vector is formed after augmenting the bottleneck features of the combination network with the initial MFCCs transformed using feature-space Maximum Likelihood Linear Regression (fMLLR).

DNN

DNN

PCA

DNN

MFCCs

MIFs

Fig. 3: Extraction of 42 deep hierarchical bottleneck features after transforming and combining MFCCs with mean instantaneous modulation frequencies (MIFs) spanning contexts of approximately 800ms ( frames).

Iv Experimental Framework

Iv-a Mutli-microphone DSR corpora

DIRHA-English corpus

The corpus [33] includes one-minute sequences simulating real-life scenarios of voice-based domestic control. Real far-field speech was recorded in a Kitchen-Livingroom space by 21 condenser microphones arranged on pairs and triplets on the walls, and pentagon arrays on the ceilings. 12 US and 12 UK English native speakers were recorded on WSJ, phonetically-rich, and home automation sentences. Moreover, clean speech was recorded in a studio by the same speakers, on the same material, and convolved with the corresponding room impulse responses to produce simulated far-field speech. Overall, 1000 noisy and reverberant utterances of real (dirha-real) and simulated (dirha-sim) far-field multichannel speech were extracted by the sequences and used for experimentation. In our experiments, beamforming is applied on the six channels (LA1-LA6) of the pentagon ceiling-array located in the livingroom.

AMI corpus

The proposed features are also tested on the three tasks of the AMI meeting corpus [34] which consists of 100 hours of meeting recordings captured, transcribed and organized for DSR benchmarking according to three microphone setups: a) individual headset microphones (IHM), b) single distant microphone (SDM), and c) multiple distant microphones (MDM). The three tasks offer us the opportunity to test the robustness of the proposed features on various setups. For the MDM scenario, the eight channels of the 10cm radius circular table array are combined via beamforming. Overlapping speech segments are excluded from our experiments. Additionally, the employed trigram language model is trained only on the transcriptions of the train set, without using the Fisher transcriptions as the standard Kaldi recipe supports. We report results on the eval set.

CHIME-4 corpus

The CHiME-4 task [35] is a far-field speech recognition challenge for single- and multi-microphone tablet device recordings in everyday scenarios under four noisy environments: street (STR), pedestrian area (PED), cafe (CAF) and bus (BUS). For training, 1600 utterances were recorded in the four environments from four speakers, and additional 7138 noisy utterances were simulated from WSJ0 by adding noises from the four noisy environments. The challenge setup consists of three tracks in which recognition is realized by using one (1ch), two (2ch), or six (6ch) channels from the tablet array. Multichannel recognition (2ch, 6ch) is based on beamforming. We report results for the three tracks on the 2640 utterances of the evaluation set, which consists of 330 utterances in each of the same eight conditions. Our recognition setup, as described in the following paragraphs, is based on the latest baseline Kaldi recipe in which TDNN acoustic models are trained on beamformed signals, while no RNNLM rescoring is applied.

Iv-B Feature extraction configuration

Fig. 4: Extraction and combination of MFCC-fMLLR [36] features with MIF () and CIF () frequency modulation features for GMM and DNN acoustic modeling.

Multiband speech demodulation is realized with a Mel-spaced filterbank of 12 and 6 Gabor filters for the extraction of 12 MIF and 60 CIF (10 DCT coefficients per filter) features for each frame, respectively. For better formant localization, the filters are overlapped by 70% and 50%, respectively. Instantaneous frequencies are smoothed with a 7-sample median filter in order to eliminate possible singularities that are caused by instabilities of the Teager-Kaiser energy operator in small amplitude values. Features are mean and variance normalized to cope with long-term effects. Standardization is applied per filter in utterance level before extracting the features in frames. Multichannel demodulation is realized by using the same channels which are employed for beamforming according to the setup of each database. Finally, modulation features are spliced in the same way as MFCCs and both sets are concatenated to the input of the employed networks. Note that LDA and fMMLR transformations, where they are referred, are applied separately on top of the two feature streams, as depicted in Fig. 

4.

Iv-C Beamforming and data augmentation

Speech denoising is also used in the front-end stage, in which the available multichannel data are beamformed by using the BeamformIt tool [37] based on the setup of each database, as described in IV-A. The BeamformIt tool a state-of-the-art delay-and-sum beamformer that is extensively used in several multichannel DSR systems and supports blind reference-channel selection and two-step time delay of arrival Viterbi postprocessing. In the absence of sufficient training data for environments with distant microphones, a practical and widely used approach for acoustic modeling is to generate artificial training data by simulating the expected acoustic conditions of the target environment. The simulation process involves convolution of studio-quality speech with room impulse responses and noise addition in several SNR levels. We follow a slightly different approach for the case of the DIRHA-English database, where in order to increase robustness and reduce the training-testing mismatch, we generate beamformed signals for training, like the ones we intent to recognize. Thus, the ceiling-array recordings for beamforming are simulated by using RIRs measured from various positions in the room.

Iv-D Recognition schemes

Iv-D1 Baseline GMM-HMM System

A baseline GMM-HMM recognizer is built based on the standard Kaldi recipe. First, tied-state triphones are trained on 13 MFCCs with their first- and second-order derivatives and then, LDA, MLLT and fMLLR transformations [36] are applied to train speaker independent models (tri6). Gaussian subspace acoustic models (sgmm) are also developed in which the universal background model (UBM) is trained on the tri6 GMMs. Regarding language modeling, trigrams are trained on the transcriptions of the training sets.

Iv-D2 Tandem Recognition

A GMM-HMM system is trained on top of the deep bottleneck features extracted by the proposed hierarchical scheme of bottleneck DNNs that is developed using TensorFlow. Each DNN consists of

hidden layers () of tanh nonlinearities. The bottleneck layer has

nodes while the last hidden layer acts like mixture components of the pdf in the softmax layer, comparable to GMMs. Nine frames are spliced and given to the input of the network which is trained to classify

frames to one of each

nodes of the softmax layer corresponding to the senones of the baseline GMM-HMM system that provides the frame-state alignments. The network weights are trained layer-wise in 20 epochs by following the iterative stochastic gradient descent training using minibatches of 256 vectors. To prevent overfitting and for adjusting the learning rate parameter, 10% of the training corpus (chosen randomly) is used as cross-validation set.

Iv-D3 Hybrid Recognition

Neural networks are trained to provide pseudo log-likelihood scores for HMM decoding. Herein, DNNs [38] and Time Delay Neural Networks (TDNNs) [39] are considered on spliced frames of MFCCs appended with modulation features. Substantially, their first layers act as feature transformation and fusion units on the combined feature sets similarly to the already described bottleneck networks. DNNs of six fully-connected hidden layers of sigmoid nonlinearities are trained on mini-batches of samples in which 9-frame splices of fMLLR-transformed MFCCs are included. Training is realized in three stages: 1) DBN pre-training, 2) frame cross-entropy training, and 3) sequence-training optimizing the sequential Minimum Bayes Risk criterion. The developed TDNNs, capable of tackling long-term interactions between speech and corrupting sources in reverberant environments, consist of five time-delay layers modeling multi-scale contexts of compared to the running frame in time . Their input features are 11-frame splices of 40-dimensional hi-resolution MFCCs appended with 100-dimensional i-vectors extracted using a 512-Gaussian UBM. The training data are augmented by applying 3-way speed perturbation using factors of and rate perturbations by picking uniformly random values in .

V Results

DIRHA MFCC +MIF +MIF_mmd MFCC_dsb +MIF_dsb
dirha-sim 62.9 47.7 45.1 36.8 37.2
dirha-real 67.9 52.9 51.6 40.5 38.8
average 65.4 50.3 48.4 38.7 38.0
TABLE I: GMM-HMM recognition WERs (%) with triphones on combinations (“+”) of MFCC and modulation features extracted after MMD (“_mmd”) or beamforming (“_dsb”).
DIRHA features mono tri -LDA-MLLT -fMLLR
dirha-sim MFCC_dsb 57.8 36.8 31.8 24.3
+ MIF_dsb 54.7 37.2 32.8 26.6
dirha-real MFCC_dsb 61.5 40.5 30.9 29.5
+ MIF_dsb 52.3 38.8 33.4 31.2
average MFCC_dsb 59.7 38.7 31.4 26.9
+ MIF_dsb 53.5 38.0 33.1 28.9
TABLE II: GMM-HMM recognition WERs (%) after MLLR-based Speaker Adaptive Training (SAT).
DIRHA MFCC_dsb +MIF_dsb +MIF_mmd +CIF_mmd
dirha-sim 23.4 22.8 22.3 21.6
dirha-real 29.1 28.8 28.5 27.8
average 26.25 25.8 25.4 24.7
TABLE III: Tandem recognition WERs (%) with Subspace GMMs on hierarchical DNN bottleneck features appended with MLLR-transformed MFCCs.
DIRHA MFCC_dsb-fmllr MIF_dsb MIF_mmd +CIF_mmd
dirha-sim 19.0 18.8 18.3 18.0
dirha-real 25.0 24.6 24.3 23.9
average 22.0 21.7 21.3 20.9
TABLE IV: Hybrid recognition WERs (%) using DNN acoustic models trained on multiple-frame combined features
AMI MFCC_dsb+ivector +MIF_dsb +MIF_mmd +CIF_mmd
IHM 25.7 25.8 25.8 25.6
SDM 50.1 48.2 48.2 46.8
MDM 43.9 41.1 40.9 40.3
TABLE V: WERs (%) on AMI corpus using xent-regularized TDNN with cleaned data and separate alignments per task.
CHIME-4 MFCC_dsb+ivector +MIF_dsb +MIF_mmd +CIF_mmd
Track sim real sim real sim real sim real
1ch 16.6 16.4 15.9 16.3 15.9 16.3 15.5 15.8
2ch 13.2 13.5 12.9 13.3 13.1 13.4 12.3 12.9
6ch 10.3 9.7 10.1 9.4 9.8 9.2 9.3 9.1
TABLE VI: WERs (%) on CHIME-4 sim/real test sets following the baseline Kaldi recipe for TDNNs on delay-and-sum beamformed signals without using RNNLM rescoring.

We evaluate the combined feature sets on three pipelines:
1) GMM-HMM recognition with triphones and speaker adaptive training, 2) tandem recognition with subspace GMMs on hierarchical deep bottleneck features, and 3) hybrid recognition with DNN and TDNN acoustic models. Baseline recognition results on the DIRHA-English corpus are presented in Table I where is evident how MIFs benefit MFCCs mostly a) when extracting the features from a single channel, b) after using multichannel demodulation, and c) after beamforming that yields the lowest WER. On the other hand, as shown in Table II, linear transformations (LDA, MLLT and fMLLR) deteriorates the performance because the combined features are not uncorrelated and Gaussian like MFCCs. However, better combinations are accomplished after using DNN-based non-linear transformations. As shown in Table III, hierarchical deep bottleneck features with subspace GMMs yield significantly better results over the SAT system. Additionally, the contribution of modulation features is increased after applying multichannel demodulation compared to beamforming. In hybrid recognition results of Table IV, the proposed features achieve modest improvements on MFCC-fMLLR for DNNs. Accordingly, they also benefit hi-resolution MFCCs with i-vectors for TDNNs, yielding relative improvements up to 15% over the baseline Kaldi recipes, as Tables V, and VI show. DSR performance is improved without degradation in clean speech as indicated by the results on the AMI IHM task.

Vi Conclusion

A new approach is presented for robust demodulation of the frequency micro-modulations of speech based on multichannel speech energy tracking over the signals of a microphone array. Better estimations of instantaneous frequencies enable the extraction of improved modulation features which are combined efficiently with standard feature sets in state-of-the-art recognition setups. Modest and consistent improvements are achieved in three challenging DSR corpora.

References

  • [1] D. Dimitriadis, P. Maragos, and A. Potamianos, “Robust AM-FM features for speech recognition,” IEEE Signal Processing Letters, 2005.
  • [2] D. Dimitriadis and E. Bocchieri, “Use of micro-modulation features in large vocabulary continuous speech recognition tasks,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 8, pp. 1348–1357, 2015.
  • [3] V. Mitra, W. Wang, H. Franco, Y. Lei, C. Bartels, and M. Graciarena, “Evaluating robust features on deep neural networks for speech recognition in noisy and channel mismatched conditions.” in Proc. Int. Conf. on Speech Communication and Technology (Interspeech), 2014, pp. 895–899.
  • [4] I. Rodomagoulakis, G. Potamianos, and P. Maragos, “Advances in large vocabulary continuous speech recognition in Greek: Modeling and nonlinear features,” in Proc. European Signal Processing Conf. (EUSIPCO), 2013, pp. 1–5.
  • [5] V. Mitra, J. Van Hout, W. Wang, M. Graciarena, M. McLaren, H. Franco, and D. Vergyri, “Improving robustness against reverberation for automatic speech recognition,” in Proc. IEEE Workshop Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 525–532.
  • [6] H. Hermansky, D. P. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional hmm systems,” in Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), vol. 3, 2000, pp. 1635–1638.
  • [7]

    T. N. Sainath, B. Kingsbury, and B. Ramabhadran, “Auto-encoder bottleneck features using deep belief networks,” in

    Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), 2012, pp. 4153–4156.
  • [8] D. Yu and M. L. Seltzer, “Improved bottleneck features using pretrained deep neural networks.” in Proc. Int. Conf. on Speech Communication and Technology (Interspeech), vol. 237, 2011, p. 240.
  • [9] Z. Tüske, R. Schlüter, and H. Ney, “Deep hierarchical bottleneck MRASTA features for LVCSR,” in Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), 2013, pp. 6970–6974.
  • [10] D. Yu and L. Deng, Automatic speech recognition: A deep learning approach.   Springer, 2014.
  • [11] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
  • [12] K. Kinoshita, M. Delcroix, S. Gannot, E. A. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj et al., “A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP Journal on Advances in Signal Processing, vol. 2016, no. 1, p. 7, 2016.
  • [13] M. Harper, “The automatic speech recogition in reverberant environments (ASpIRE) challenge,” in Proc. IEEE Workshop Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 547–554.
  • [14] D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “Kl-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition,” in Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), 2013, pp. 7893–7897.
  • [15] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vectors.” in Proc. IEEE Workshop Automatic Speech Recognition and Understanding (ASRU), 2013, pp. 55–59.
  • [16] V. Mitra, H. Franco, R. M. Stern, J. Van Hout, L. Ferrer, M. Graciarena, W. Wang, D. Vergyri, A. Alwan, and J. H. Hansen, “Robust features in deep-learning-based speech recognition,” in New Era for Robust Speech Recognition.   Springer, 2017, pp. 187–217.
  • [17] M. Brandstein and D. Ward, Microphone arrays: signal processing techniques and applications.   Springer Science & Business Media, 2013.
  • [18] Y. Liu, P. Zhang, and T. Hain, “Using neural network front-ends on far field multiple microphones based speech recognition,” in Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), 2014, pp. 5542–5546.
  • [19] X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L. Seltzer, G. Chen, Y. Zhang, M. Mandel, and D. Yu, “Deep beamforming networks for multi-channel speech recognition,” in Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), 2016, pp. 5745–5749.
  • [20] M. Delcroix, T. Yoshioka, N. Ito, A. Ogawa, K. Kinoshita, M. Fujimoto, T. Higuchi, S. Araki, and T. Nakatani, “Multichannel speech enhancement approaches to DNN-based far-field speech recognition,” in New Era for Robust Speech Recognition.   Springer, 2017, pp. 21–49.
  • [21] P. Swietojanski, A. Ghoshal, and S. Renals, “Hybrid acoustic models for distant and multichannel large vocabulary speech recognition,” in Proc. IEEE Workshop Automatic Speech Recognition and Understanding (ASRU), 2013, pp. 285–290.
  • [22] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), 2017, pp. 5220–5224.
  • [23] P. Tsiakoulis, A. Potamianos, and D. Dimitriadis, “Instantaneous frequency and bandwidth estimation using filterbank arrays,” in Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), 2013, pp. 8032–8036.
  • [24] J. F. Kaiser, “On a simple algorithm to calculate the energy of a signal,” in Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), 1990.
  • [25] P. Maragos, J. F. Kaiser, and T. F. Quatieri, “Energy separation in signal modulations with application to speech analysis,” IEEE Transactions on Signal Processing, vol. 41, no. 10, pp. 3024–3051, 1993.
  • [26] J. F. Kaiser, “Some useful properties of Teager’s energy operators,” in Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), vol. 3, 1993, pp. 149–152.
  • [27] P. Maragos and A. Potamianos, “Higher order differential energy operators,” IEEE Signal Process. Lett., vol. 2, no. 8, pp. 152–154, 1995.
  • [28] S. Lefkimmiatis, P. Maragos, and A. Katsamanis, “Multisensor multiband cross-energy tracking for feature extraction and recognition,” in Proc. IEEE Int. Conf. Acous., Speech, and Signal Processing (ICASSP), 2008, pp. 4741–4744.
  • [29] I. Rodomagoulakis and P. Maragos, “On the improvement of modulation features using multi-microphone energy tracking for robust distant speech recognition,” in Proc. European Signal Processing Conf. (EUSIPCO), 2017, pp. 558–562.
  • [30] D. Dimitriadis and P. Maragos, “Continuous energy demodulation methods and application to speech analysis,” Speech Communication, vol. 48, no. 7, pp. 819–837, 2006.
  • [31]

    S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada, “Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition,” in

    LREC, 2000.
  • [32] D. Dimitriadis, A. Potamianos, and P. Maragos, “A comparison of the squared energy and Teager-Kaiser operators for short-term energy estimation in additive noise,” IEEE Trans. Signal Process., vol. 57, no. 7, pp. 2569–2581, 2009.
  • [33] M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, and M. Omologo, “The DIRHA-ENGLISH corpus and related tasks for distant-speech recognition in domestic environments,” in Proc. IEEE Workshop Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 275–282.
  • [34] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal et al., “The AMI meeting corpus: A pre-announcement,” in Machine Learning for Multimodal Interaction.   Springer, 2006, vol. LNCS-3869, pp. 28–39.
  • [35] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Computer Speech and Language, vol. 46, pp. 535–557, 2017.
  • [36] S. P. Rath, D. Povey, K. Veselỳ, and J. Cernockỳ, “Improved feature processing for deep neural networks.” in Proc. Int. Conf. on Speech Communication and Technology (Interspeech), 2013, pp. 109–113.
  • [37] X. Anguera, C. Wooters, and J. Hernando, “Acoustic beamforming for speaker diarization of meetings,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 15, no. 7, pp. 2011–2022, 2007.
  • [38] K. Veselỳ, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks.” in Proc. Int. Conf. on Speech Communication and Technology (Interspeech), 2013, pp. 2345–2349.
  • [39] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts.” in Proc. Int. Conf. on Speech Communication and Technology (Interspeech), 2015, pp. 3214–3218.