Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.
We train a bank of complex filters that operates on the raw waveform and feeds into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks (MFSC, for mel-frequency spectral coefficients), and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently outperform their counterparts trained on comparable MFSC. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.READ FULL TEXT VIEW PDF
Triangular, overlapping Mel-scaled filters ("f-banks") are the current
In this paper we propose an end-to-end LSTM-based model that performs
When convolutional neural networks are used to tackle learning problems ...
End-to-end models for speech translation (ST) more tightly couple speech...
The lack of data tends to limit the outcomes of deep learning research -...
State-of-the-art speech recognition systems rely on fixed, hand-crafted
Mispronunciation detection and diagnosis (MDD) is a core component of
Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.
, and contain invaluable priors for speech recognition tasks. However, even if a consensus has been reached on the proper setting of the hyperparameters of these filterbanks along the years, there is no reason to believe that they are optimal representations of the input signal for all recognition tasks. In the same way deep architectures changed the landscape of computer vision by directly learning from raw pixels[3, 4], we believe that future end-to-end speech recognition system will learn directly from the waveform.
propose an architecture composed of a convolutional layer followed by max-pooling and a nonlinearity, so that gammatone filterbanks correspond to a particular configuration of the network. explore an alternative architecture, with the intention to represent MFSC rather than gammatones. They propose a -layer convolutional architecture followed by two networks-in-networks , pretrained to reproduce MFSC.
We also focus on MFSC because they are the front-end of state-of-the-art phone  and speech  recognition systems. Our work builds on , who introduce a time-domain approximation of MFSC using the first-order coefficients of a scattering transform. This leads us to study an architecture using a convolutional layer with complex-valued weights, followed by a modulus operator and a low-pass filter. In contrast to , we propose a lightweight architecture that serves as a plug-in, learnable replacement to MFSC in deep neural networks. Moreover, we avoid pretraining by initializing the complex convolution weights with Gabor wavelets whose center frequency and bandwidth match those of MFSC.
We perform phone recognition experiments on TIMIT and show that given competitive end-to-end models trained with MFSC as inputs, training the same architectures by replacing the MFSC with the learnable architecture leads to performances that are better than when using MFSC. Moreover, our best model is obtained by learning everything except for the non-linearities, including a pre-emphasis layer.
We present the standard MFSC and their practical implementation. We then describe a learnable replacement of MFSC that uses only convolution operations in time domain, and how to set the weights to reproduce MFSC.
Given an input signal
, MFSC are computed by first taking the short-time Fourier transform (STFT) offollowed by taking averages in the frequency domain according to triangular filters with centered frequency and bandwidth that increase linearly in log-scale. More formally, let be a Hanning window of width and be filters whose squared frequency response are triangles centered on with full width at half maximum (FWHM) . Denoting by the windowed signal at time step , and the Fourier transform of function , the filterbank is the set of functions :
As in , we approximate MFSC in the time domain using:
where is a wavelet that approximates the -th triangular filter in frequency, i.e. , while is the Hanning window also used for the MFSC. The approximation is valid when the time support of is smaller than that of .
This approximation of MFSC is also known as a first-order scattering transform. This is the foundation of the deep scattering spectrum , which cascades scattering transforms to retrieve information that is lost in the MFSC. Deep scattering spectra have been used as inputs to neural networks trained for phone recognition  or classification , which showed better performances than comparable models trained on MFSC. In this work, we do not use the deep scattering spectrum. First-order scattering coefficients provide us with both a design for the first layers of the network architecture to operate on the waveform, and an initialization that approximates the MFSC computation.
Given the MFSC center frequencies and FWHM , we use (2) to approximate MFSC with Gabor wavelets:
where is the desired center frequency, and the width parameter of the Gabor wavelet is set to match the desired FWHM . Since for a frequency we have , the FWHM is and we take . Each is then normalized to have the same energy as . Figure 1 (a) shows in frequency-domain the triangular averaging operators of usual MFSC and the corresponding Gabor wavelets. Figures 1 (b) and (c) compare the
-dimensional spectrograms of the MFSC and the Gabor wavelet approximation on a random sentence of the TIMIT corpus after mean-variance normalization, showing that the spectrograms are similar.
MFSC specification. The standard setting in speech recognition is to start from the waveform sampled at kHz and represented as -bit signed integers. The STFT is computed with frequency bins using Hanning windows of width ms, and decimation is applied by taking the STFT every ms. There are filters, with center frequencies that span the range by being equally spaced on a mel-scale. The final features are the . In practice, the STFT is applied to the raw signal after a pre-emphasis with parameter , and coefficients have mean-variance normalization per utterance.
|Layer type||Input size||Output size||Width||Stride|
|Add 1, Log||-||-||-||-|
|Learning mode||Dev PER||Test PER|
PER of the CNN-5L-ReLU-do0.7 model trained on MFSC and different learning setups of TD-filterbanks.
|Model||Input||Dev PER||Test PER|
|Hybrid HMM/Hierarchical CNN + Maxout + Dropout ||MFSC + energy + +||13.3||16.5|
|CNN + CRF on raw speech ||wav||-||29.2|
|Attention model + Conv. Features + Smooth Focus ||MFSC + energy + +||15.8||17.6|
|LSTM + Segmental CRF ||MFSC + +||-||18.9|
|LSTM + Segmental CRF ||MFCC + LDA + MLLT + MLLR||-||17.3|
|CNN-5L-ReLU-do0.5 + TD-filterbanks||wav||18.2||20.4|
|CNN-5L-ReLU-do0.7 + TD-filterbanks||wav||17.3||20.3|
|CNN-8L-PReLU-do0.7 + TD-filterbanks||wav||15.6||18.1|
|CNN-8L-PReLU-do0.7 + TD-filterbanks-Learn-all-pre-emp||wav||15.6||18.0|
Learnable architecture specification. The time-domain convolutional architecture is summarized in Table 1. With a waveform sampled at kHz, a Hanning window is a convolution operator with a span of samples (ms). Since the energy of the Gabor wavelets approximating standard MFSC has a time spread smaller than the Hanning window, the complex wavelet+modulus operations are implemented as a convolutional layer taking the raw wav as input, with a width and filters ( filters for the real and imaginary parts respectively). This layer is on the top row of Table 1. The modulus operator is implemented with “feature L2 pooling”, a layer taking an input of size and outputs of size such that . The windowing layer (third row of Table 1) is a grouped convolution, meaning that each output filter only sees the input filter with the same index. The decimation of ms is implemented in the stride of of this layer. Notice that to approximate the mel-filterbanks, the square of the Hanning window is used and biases in both convolutional layers are set to zero. We keep them to zero during training. We add log compression to the output of the grouped convolution after adding to its absolute value since we do not have positivity constraints on the weights when learning. Contrarily to the MFSC, there is no mean-variance normalization after the convolutions, but on the waveform. In the default implementation of the TD-filterbanks, we do not apply pre-emphasis. However, in our last experiment, we add a convolutional layer below the TD-filterbanks, with width and stride , initialized with the pre-emphasis parameters, as another learnable component.
We perform phone recognition experiments on TIMIT 
using the standard train/dev/test split. We train and evaluate our models with 39 phonemes. We experiment with three architectures. The first one consists of 5 layers of convolution of width 5 and 1000 feature maps, with ReLU activation functions, and a dropout of 0.5 on every layer but the input and output ones. The second model has the same architecture but a dropout of 0.7 is used. The third model has 8 layers of convolution, PReLU  nonlinearities and a dropout of 0.7. All our models are trained end-to-end with the Autoseg criterion 
, using stochastic gradient descent. We compare all models using either the baseline MFSC as input or our learnable TD-filterbank front-end. We perform the same grid-search for both MFSC baselines and models trained on TD-filterbanks, using learning rates infor the model and learning rates in
for the Autoseg criterion, training every model for 2000 epochs. We use the standard dev set for early stopping and hyperparameter selection.
Throughout our experiments, we tried four different settings for the TD-filterbank layers:
Fixed: Initialize the layers to match MFSC and keep their parameters fixed when training the model
Learn-all: Initialize the layers and let the filterbank and the averaging be learned jointly with the model
Learn-filterbank: Start from the initialization and only learn the filterbank with the model, keeping the averaging fixed to a squared hanning window
Randinit: Initialize the layers randomly and learn them with the network
Table 2 shows comparative performance of an identical architecture trained on the four types of TD-filterbanks. We can observe that training on fixed layers moderately worsens the performance, we hypothesize that this is due to the absence of mean-variance normalization on top of TD-filterbanks as is performed on MFSC. A striking observation is that a model trained on TD-filterbanks initialized randomly performs considerably worse than all other models. This shows the importance of the initialization. Finally, we observe better results when learning the filterbank only compared to learning the filterbank and the averaging but depending on the architecture it was not clear which one performs better. Moreover, when learning both complex filters and averaging, we observe that the learned averaging filters are almost identical to their initialization. Thus, in the following experiments, we choose to use the Learn-filterbank mode for the TD-filterbanks.
We report PER on the standard dev and test sets of TIMIT. For each architecture, we can observe that the model trained on TD-filterbanks systematically outperforms the equivalent model trained on MFSC, even though we constrained our TD-filterbanks such that they are comparable to the MFSC and do not learn the low-pass filter. This shows that by only learning a new bank of 40 filters, we can outperform the MFSC for phone recognition. This gain in performance is obtained at a minimal cost in terms of number of parameters: even for the smallest architecture, the increase in number of parameters in switching from MFSC to TD-filterbanks is . We also compare to baselines from the literature. One baseline trained on the waveform gets a PER of on the test set, which is in a range absolute above our models trained on the waveform. The Wavenet architecture, also trained on the waveform, yields a PER of , which is higher than our best models despite using the phonetic alignment and an auxiliary prediction loss. Our best model on the waveform also outperforms a 2-dimensional CNN trained on MFSC and an LSTM trained on MFSC with derivatives. Finally, by adding a learnable pre-emphasis layer below the TD-filterbanks, we reach PER on the test set.
We analyze filters learned by the first layer of the CNN-8L-PReLU-do0.7 + TD-filterbanks model. Examples of learned filters are shown in Figure 2. The magnitude of the frequency response for each of the 40 filters is plotted in Figure 3
. Overall, the filters tend to be well localized in time and frequency, and a number of filters became asymmetric during the learning process, with a sharp attack and slow decay of the impulse response. This is a type of asymmetry also found in human and animal auditory filters estimated from behavioral and physiological data. In Figure 3, we further see that the initial mel-scale of frequency is mostly preserved, but that a lot of variability in the filter bandwidths is introduced.
A prominent question is whether the analyticity of the initial filterbank is preserved throughout the learning process even though nothing in our optimization method is biased towards keeping filters analytic. A positive answer would suggest that complex filters in their full generality are not necessary to obtain the increase in performance we observed. This would be especially interesting because, unlike arbitrary complex filters, analytic filters have a simple interpretation in terms of real-domain signal processing: taking the squared modulus of the convolution of a real signal with an analytic filter performs a sub-band Hilbert envelope extraction .
A signal is analytic if and only if it has no energy in the negative frequencies. Accordingly, we see in Figure 3 that there is zero energy in this region for the initialization filterbank. After learning, a moderate amount of energy appears in the negative frequency region for certain filters. To quantify this, we computed for each filter the ratio between the energy in negative versus positive frequency components 111Our model cannot identify if a given filter plays the role of the real or imaginary part in the associated complex filter. We chose the assignment yielding the smallest .. This ratio is 0 for a perfectly analytic filter and 1 for a purely real filter. We find an average for all learned filters of . Filters with significant energy in negative frequencies are mostly filters with an intermediate preferred frequency (between 1000Hz and 3000Hz) and their negative frequency spectrum appears to be essentially a down-scaled version of their positive frequency spectrum.
We proposed a lightweight architecture which, at initialization, approximates the computation of MFSC and can then be fine-tuned with an end-to-end phone recognition system. With a number of parameters comparable to standard MFSC, a TD-filterbank front-end is consistently better in our experiments. Learning all linear operations in the MFSC derivation, from pre-emphasis up-to averaging provides the best model. In future work, we will perform large scale experiments with TD-filterbanks to test if a new state-of-the-art can be achieved by training from the waveform.
Authors thank Mark Tygert for useful discussions, and Vitaliy Liptchinsky and Ronan Collobert for help on the implementation. This research was partially funded by the European Research Council (ERC-2011-AdG-295810 BOOTPHON), the Agence Nationale pour la Recherche (ANR-10-LABX-0087 IEC, ANR-10-IDEX-0001-02 PSL*).
“Imagenet classification with deep convolutional neural networks,”in NIPS, 2012.
Journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.