Examples of feature normalisation techniques are cepstral mean and variance normalisation (CMVN), histogram equalisation (HEQ), stereo-based piecewise linear compensation for environments (SPLICE) etc. While some of the techniques operate at utterance level and estimate statistical parameters out of the test utterances, some operate on individual feature vectors. Parameter estimation in the former set of techniques requires sufficiently long utterances for reliable estimation, and also some processing. The latter are best suited for shorter utterances and real-time applications.
The standard features used for speech feature extraction are the Mel-frequency cepstral coefficients (MFCCs). In the final stage of their extraction process, a discrete cosine transform (DCT)111DCT, by default hereafter, refers to Type-II DCT is used for dimensionality reduction and feature decorrelation. An interesting question to ponder at this juncture is: whether DCT is the best method to achieve feature dimentionality reduction and feature decorrelation along with noise robustness.
Heteroscedastic linear discriminant analysis (HLDA) [Kumar, 1997] is one technique which achieves these qualities under low noise conditions, even though it was originally designed for a different objective. The technique fails in high noise environments. This thesis proposes using non-negative matrix factorisation (NMF) [Lee and Seung, 1999] and reconstruction of the features prior to applying the DCT in the standard MFCC extraction process. Two methods of achieving noise robust features are discussed in detail. These techniques do not assume any information about noise during the training process.
It is also reasonable to identify the commonly encountered noise environments
during training, and learn their respective compensations. During
the real-time operation of the system, the background noise can be
classified into one of the known kinds of noise so that the corresponding
compensation can be applied. An example of one such technique is the
stereo-based piecewise compensation for environments (SPLICE)
It is also reasonable to identify the commonly encountered noise environments during training, and learn their respective compensations. During the real-time operation of the system, the background noise can be classified into one of the known kinds of noise so that the corresponding compensation can be applied. An example of one such technique is the stereo-based piecewise compensation for environments (SPLICE)[Deng et al., 2000]. This technique requires stereo data during training, which consists of simultaneously recorded speech using two microphones. One is a close-talk microphone capturing mostly clean speech and the other is a far-field microphone, which captures noise along with speech.
This technique is of particular interest because it operates on individual
feature vectors, and the compensation is an easily implementable linear
transformation of the feature. However, there are two disadvantages
of SPLICE. The algorithm fails when the test noise condition is not
seen during training. Also, owing to its requirement of stereo data
for training, the usage of the technique is quite restricted. This
thesis proposes a modified version of SPLICE that improves its performance
in all noise conditions, predominantly during severe noise mismatch.
An extension of this modification is also proposed for datasets that
are not stereo recorded, with minimal performance degradation as compared
to the conventional SPLICE. To further boost the performance, run-time
adaptation of the parameters is proposed, which is computationally
very efficient when compared to maximum likelihood linear regression
(MLLR), a standard model adaptation method.
This technique is of particular interest because it operates on individual feature vectors, and the compensation is an easily implementable linear transformation of the feature. However, there are two disadvantages of SPLICE. The algorithm fails when the test noise condition is not seen during training. Also, owing to its requirement of stereo data for training, the usage of the technique is quite restricted. This thesis proposes a modified version of SPLICE that improves its performance in all noise conditions, predominantly during severe noise mismatch. An extension of this modification is also proposed for datasets that are not stereo recorded, with minimal performance degradation as compared to the conventional SPLICE. To further boost the performance, run-time adaptation of the parameters is proposed, which is computationally very efficient when compared to maximum likelihood linear regression (MLLR), a standard model adaptation method.
1.2. Overview of the Thesis
The rest of the thesis is organised as follows. Chapter 2 summarises the required background and revises some existing techniques in literature. Chapter 3 discusses the proposed methods of performing feature compensation using NMF during MFCC extraction, and assumes no information about noise during training. Chapter 4 details the proposed modifications and techniques using SPLICE. Finally, Chapter 5 concludes the thesis, indicating possible future extensions.
2.1. HMM-GMM based Speech Recognition
The aim of a speech recognition system is to efficiently convert a speech signal into a text transcription of spoken words [Rabiner and Schafer, 2010]. This requires extracting relevant features from the largely available speech samples, which is called feature extraction. By this process, the speech signal is converted into a stream of feature vectors capturing its time-varying spectral behaviour.
The feature vector streams of each basic sound unit are statistically
modelled as HMMs using a training process called Baum-Welch algorithm.
This requires sufficient amount of transcribed training speech. A
dictionary specifying the conversion of words into the basic sound
units is necessary in this process. During testing, an identical feature
extraction process, followed by Viterbi decoding are performed to
obtain the text output. The decoding is performed on a recognition
network built using the HMMs, dictionary and a language model. Language
model gives probabilities to sequences of words, and helps in boosting
the performance of the ASR system.
The feature vector streams of each basic sound unit are statistically modelled as HMMs using a training process called Baum-Welch algorithm. This requires sufficient amount of transcribed training speech. A dictionary specifying the conversion of words into the basic sound units is necessary in this process. During testing, an identical feature extraction process, followed by Viterbi decoding are performed to obtain the text output. The decoding is performed on a recognition network built using the HMMs, dictionary and a language model. Language model gives probabilities to sequences of words, and helps in boosting the performance of the ASR system.
This process is summarised in Figure 2.1.1. More detailed explanation can be found in [Rabiner and Schafer, 2010]. The choice of the basic sound unit depends on the size of the vocabulary. Word models are convenient to use for tasks such as digit recognition. For large vocabulary continuous ASR, triphone models are built, which can be concatenated appropriately during Viterbi decoding to represent words.
2.2. MFCC Feature Extraction
Apart from the information about what has been spoken, a speech signal also contains distinct characteristics which vary with the recording conditions such as background noise, microphone distortion, reverberation and so on. Speaker dependent parameters such as vocal tract structure, accent, mood and style of speaking also affect the signal. Thus an ASR system requires robust feature extraction process which captures only the speech information, discarding the rest.
MFCCs have been shown to be one of the effective features for ASR. The extraction of MFCCs is summarised in Figure 2.2.1, and involves the following steps:
Short-time processing: Convert the speech signal into overlapping frames. On each frame, apply pre-emphasis to give weightage to higher formants, apply a hamming window to minimise the signal truncation effects, and take magnitude of the DFT.
Log Mel filterbank (LMFB): Apply a triangular filterbank with Mel-warping and obtain a single coefficient output for each filter. Apply log operation on each coefficient to reduce the dynamic range. These operations are motivated by the acoustics of human hearing.
DCT: Apply DCT on each frame of LMFB coefficients. This provides energy compaction useful for dimensionality reduction, as well as approximate decorrelation useful for diagonal covariance modelling.
A cepstral lifter is generally used to give approximate equal weightage to all the coefficients in MFCCs (the amplitude diminishes due to energy compaction property of DCT). Delta and acceleration coefficients are also appended to the liftered MFCCs, to capture the dynamic information in the speech signal. Finally cepstral mean subtraction (CMS) is performed on these composite features, to remove any stationary mismatch effects caused by recording in different environments.
2.3. Need for Additional Processing
Let us look at the robustness aspects of MFCC composite features in response to two kinds of undesired variations, viz., speaker-based and environment-based.
During MFCC extraction, most of the pitch information is discarded by the smoothing operation of LMFB. Some speaker-specific characteristics are removed by truncating the higher cepstral coefficients after DCT operation. However, other variations such as those occurring due to differences in vocal-tract structures can be compensated for better recognition results.
MFCC composite features are less robust to environment changes. The presence of background noise, especially during testing, causes serious non-linear distortions which cannot be compensated using CMS. This acoustic mismatch between the training and testing environments needs additional compensation.
2.3.1. Feature Compensation and Model Adaptation
The techniques which operate on features to nullify the undesired effects are called feature compensation techniques. Examples are CMVN, HEQ. These methods are usually simple and can be incorporated into real-time applications. So there is an interest in understanding and improving these techniques.
On the contrary, there are a class of model adaptation techniques which refine the models to compensate the effects. These are usually computationally intense and yield high recognition rates. Examples are maximum likelihood linear regression (MLLR), speaker adaptive training (SAT).
Some techniques such as joint factor analysis (JFA), joint uncertainty decoding (JUD) use a combination of both feature and model compensation to further improve the recognition.
However, this thesis focuses on feature compensation done on frame-by-frame basis, due to their suitability to real-time applications.
2.4. A Brief Review of Some Techniques Used
Given features belonging to various classes, this technique aims at achieving discrimination among the classes through dimensionality reduction. It linearly transforms dimensional features such that only dimensions of the transformed features have the discriminating capability, and the remaining dimensions can be discarded. In ASR, the features assigned to each state of an HMM during training are typically considered as belonging to a class. The transformation is estimated from the training data using class labels obtained from their first-pass transcriptions. The dimensional new features are then used to train and test the models.
Estimation of HLDA transform
Let the feature be transformed to obtain . Let denote that the class-label of is . For each class , the set are assumed to be Gaussian, and thus are their corresponding . It is desired that the last dimensions of do not contain any discriminative information, i.e., the mean and covariance of all the classes are identical in their last dimensions, as
where and are of dimensions and respectively. If this is achieved, the last dimensions of can be discarded without loss of discriminability.
The density of each is modelled as
Numerical solution for can be derived by differentiating the log-likelihood function w.r.t , and using the maximum likelihood estimates of and obtained from [Kumar, 1997].
HEQ techniques are used to compensate for the acoustic mismatch between the training and testing conditions of an ASR system, thereby giving improved performance. HEQ is a feature compensation technique which defines a transformation that maps the distribution of test speech features onto a reference distribution. As shown in Figure 2.4.1, a function can be learned such that any noise feature component can be transformed to its corresponding equalised version .
The reference distribution can be that of either clean speech, training
data or even a parametric function. Both the training and test features
need to be equalised to avoid mismatch. Since HEQ matches the whole
distribution, it matches the means, variances and all other higher
The reference distribution can be that of either clean speech, training data or even a parametric function. Both the training and test features need to be equalised to avoid mismatch. Since HEQ matches the whole distribution, it matches the means, variances and all other higher moments.
Let be a component of a noisy speech feature vector modelled as . The relation between the cumulative distribution of and that of its equalised version is given by
MLLR is a widely used adaptation technique based on maximum-likelihood principle [Legetter, 1995, Gales, 1998]. It performs adaptation through regression-based transformations on (usually) the mean vectors of the system of HMMs, being the mixture index. The transformations are estimated such that the original system is tuned to a new speaker or environment.
where is an MLLR transformation matrix of dimension , and . An estimate of is obtained by maximising the likelihood function
In this work, MLLR mean adaptation has been used as a global transform to adjust the system of HMMs to a new noise environment encountered during testing. The adaptation data are the same as test data, which consist of files recorded in a particular noise condition with sufficient speaker and speech variations.
SPLICE is a popular and efficient noise robust feature enhancement technique. It partitions the noisy feature space into classes, and learns a linear transformation based noise compensation for each partition class during training, using stereo data. Any test vector is soft-assigned to one or more classes by computing , and is compensated by applying the weighted combination of linear transformations to get the cleaned version .
and are estimated during training
using stereo data. The training noisy vectors are
modelled using a Gaussian mixture model (GMM) of
mixtures, and is calculated
for a test vector as a set of posterior probabilities w.r.t the GMM
is calculated for a test vector as a set of posterior probabilities w.r.t the GMM. Thus the partition class is decided by the mixture assignments . This is illustrated in Figure 2.4.2.
NMF is an approximate matrix decomposition
where , consisting of non-negative elements, is decomposed into two non-negative matrices and . In the context of speech data, the columns of constitute non-negative feature vectors . After the decomposition, is a non-negative dictionary of basis vectors along columns, representing a useful subspace in which are contained, when . is a matrix of vectors such that each is a set of weights or activation coefficients acting upon all the bases to give the corresponding , independently of all the other columns, as shown in equation 2.4.3.
The decomposition (2.4.2) can be learned by randomly initialising and . The estimates of and are iteratively improved by minimising a KL-divergence based cost function
where denotes Hadamard product, the division inside the log is element-wise and the summation is over all the elements of the matrix. The optimisation
where “” refers to assignment operator. It can be seen that the update rules are multiplicative, i.e., the matrices are updated by performing just a product with another matrix, making the implementation simple and quick. The columns of can be thought of as basic building blocks that can reconstruct speech features.
2.5. Recent Work in Literature - Motivation
[Sainath et al., 2012] showed an overview of a wide range of techniques in ASR that use speech exemplars (dictionaries). It was argued that noise robustness can be achieved through the use of speech dictionaries. However, most of these techniques are computationally intense.
Since NMF is one of the methods of learning dictionaries and is easily implementable, it is of particular interest. NMF is known to learn useful time-frequency patterns in a given dataset, and has been applied to learn spectral representations in audio and speech applications that include audio source separation, supervised speech separation, speech enhancement and recognition. A few of them are mentioned below.
A regularised variant of NMF was used by [Wilson et al., 2008] to learn separate dictionaries of speech and noise. The concatenated dictionary was used in NMF to learn weights of noisy test utterances, where the weights corresponding to noise bases are suppressed to achieve speech denoising and thus enhancement. [Schuller et al., 2010] proposed supervised NMF for improving noise robustness in spelling recognition task. NMF was performed using a predetermined consisting of spectra of spelled letters. The authors showed that appending the weights [Gemmeke et al., 2011] represented LMFB features of noisy speech using exemplars of speech and noise bases. A hybrid (exemplar and HMM based) recognition on Aurora-2 task was performed to achieve noise robustness at high noise levels.
Most of the above techniques have used the weights as new features or combining them with existing features for improved recognition. In contrast, Chapter 3 of this thesis will show that multiplying back and to get new LMFB features and converting them to MFCCs improves noise robustness. This approach is useful in real-time applications because of its fast implementation. A technique which learns a robust will also be discussed. These methods do not assume any information about noise during training.
When there are noisy training files available, techniques such as SPLICE learn compensation from the seen noisy data to obtain their corresponding clean versions. Over the last decade, improvements using uncertainty decoding [Droppo et al., 2002], maximum mutual information based training [Droppo and Acero, 2005], speaker normalisation [Shinohara et al., 2008] etc. were introduced in SPLICE framework. There are two disadvantages of SPLICE. The algorithm fails when the test noise condition is not seen during training. Also, owing to its requirement of stereo data for training, the usage of the technique is quite restricted. So there is an interest in addressing these issues.
[Chijiiwa et al., 2012] recently proposed an adaptation framework using Eigen-SPLICE to address the problem of unseen noise conditions. The method involves preparation of quasi stereo data using the noise frames extracted from non-speech portions of the test utterances. For this, the recognition system is required to have access to some clean training utterances for performing run-time adaptation.
[Gonzalez et al., 2011] proposed a stereo-based feature compensation method, which is similar to SPLICE in certain aspects. Clean and noisy feature spaces were partitioned into vector quantised (VQ) regions. The stereo vector pairs belonging to VQ region in clean space and VQ region in noisy space are classified to the sub-region. Transformations based on Gaussian whitening expression were estimated from every noisy sub-region to clean sub-region. But it is not always guaranteed to have enough data to estimate a full transformation matrix from each sub-region to other.
In Chapter 4, a simple modification to SPLICE will be proposed, based on an assumption made on the correlation of training stereo data. This will be shown to give improved performance in all the noise conditions, predominantly in unseen conditions which are highly mismatched with those of training. An extension of the method to non-stereo datasets (which are not stereo recorded) will be proposed, with minimal performance degradation as compared to conventional SPLICE. Finally, an MLLR-based run-time noise adaptation framework will be proposed, which is computationally efficient and achieves better results than MLLR-based model adaptation.
During MFCC extraction, usually 23 LMFB coefficients are converted into 13 dimensional MFCCs through DCT. Results in Table 3.1 show that performing HLDA on 39 dimensional MFCCs gives better robustness to noise than MFCCs when the noise levels are low. But this technique fails in high noise conditions. However, the objective of HLDA is not to achieve robustness, but class-separability. Here a method is proposed which aims at finding representations of speech feature vectors using building-blocks.
The method operates on LMFB feature vectors by representing them using non-negative linear combinations of their non-negative building blocks. DCT-II is applied on the new feature vectors to obtain new MFCCs. This incorporates, into the MFCC extraction process, a concept that the speech features are made up of underlying building blocks. Apart from the proposed additional step, conventional MFCC extraction framework is maintained throughout the process. The building blocks of speech are learned using NMF on speech feature vectors. Experimental results show that the new MFCC features are more robust to noise, and achieve better results when combined with the existing noise robust feature normalisation techniques like HEQ and HLDA.
3.2. The Speech Subspace
The columns of can be thought of as basic building blocks that construct all the speech feature vectors, or the bases for subspace of the speech feature vectors. So far, no mathematical proof has been derived to validate if NMF learns the representations of the underlying data. However, much of the literature support this. [Smaragdis, 2007] states that the basis functions describe the spectral characteristics of the input signal components, and reveal its vertical structure. [Wilson et al., 2008] refer to these building blocks as a set of spectral shapes. [Schuller et al., 2010] refer to them as spectra of events occurring in the signal, and NMF is known to learn useful time-frequency patterns in a given dataset.
One could probably use other basis learning techniques like principal component analysis (PCA) to estimate the clean subspace. Though PCA finds the directions of maximum spread of the features, these directions need not contain only speech information. In methods such as NMF, new features can be reconstructed within the subspace using different cost functions to achieve desired qualities such as source localisation, sparseness etc. KL-divergence based NMF has been successfully applied in speech applications [Wilson et al., 2008, Schuller et al., 2010] over the conventional Euclidean distance measure. In addition, the non-negativity constraint is an added advantage. Figure 3.1.1 shows the bases and learned by NMF on two-dimensional data contained in the subspace shown. Any vector outside this subspace cannot be reconstructed perfectly using and when the weights acting on them are constrained to be non-negative. So when these features are moved away from the subspace due to the effect of noise, the reconstructed features can be used in place of these features as better representations of the underlying signal. However, vectors still inside the subspace are not compensated.
3.3. Learning the Speech Subspace
NMF can be performed by optimising cost functions based on different measures such as Euclidean, KL-divergence, Itakura-Saito distances. Here KL-divergence based method is chosen, which gives the update equations (2.4.5) and (2.4.6). The decomposition is not unique since the cost function (2.4.4) to be optimised is not convex. Depending on the application, one may choose to perform update on both and in each iteration, or fix one matrix and update the other. While simultaneously refining and , the columns of may be normalised after each update so that the sum of each column adds up to value . The scaling is automatically compensated in weight matrix during its update.
Fig. (a)a shows the plot of LMFB outputs of an utterance containing connected spoken digits. Using the dictionary learned from Aurora-2 database (as described in Section 3.4.1), the reconstructions of the same utterance from each individual basis vector of are plotted in Fig. (b)b. It can be observed that each of the reconstructions has only a set of particular dominant frequencies that are captured by the corresponding basis vector. Such a dictionary can capture useful combinations of frequencies present in speech utterances, and can give noise-robust speech reconstructions. The reconstructed feature vectors are more correlated, due to their confinement to the speech subspace, than the original ones.
In PCA method, the speech dictionary can be built by stacking the most significant Eigen vectors of the clean training data as columns. Any speech feature can be reconstructed as
3.4. Proposed Feature Extraction Methods
Speech signal is passed through conventional short-time processing steps followed by an LMFB. Before applying DCT, which corresponds to obtaining the conventional MFCC features, an additional step is proposed to be introduced as shown in Fig. 3.4.1, so that the new MFCCs obtained after the processing are more robust to noise. The additional step is computed in two different methods as described in Sections 3.4.1 and 3.4.2, and the performances are compared in Sections 4.5.2.
Short-time processing of speech signal includes applying STFT with Hamming window of length 25 at a frame rate of 100 frames/second, followed by passing the mean subtracted frames through a pre-emphasis filter . A conventional Mel-filter bank of constant bandwidth equal gain linear triangular filters on the Mel scale is applied to get a set of filter outputs for each frame. These outputs are Mel-floored to value 1.0, and log operator is applied to obtain non-negative LMFB feature vectors .
Let the LMFB feature vectors and corresponding to the training and test speech be stacked as columns of and respectively.
is decomposed as using NMF decomposition by simultaneously updating and by (2.4.5) and (2.4.6). Each column of is normalised after every iteration. Finally, learns the building blocks of these feature vectors. During testing, is approximated as the product of and (i.e., ). This is done by fixing obtained during training and performing update on alone using (2.4.6). The new feature vectors in training and testing are thus
respectively. In the implementation, the whole training data has been taken as to compute using NMF. However, during testing, each test utterance can separately be taken as and the corresponding can be computed, fixing .
Here the assumption is that many of the columns of are initially outside the speech subspace due to the effect of noise. If each of these vectors outside the subspace is mapped to the nearest vectors within the subspace, the new features are more noise-robust. If noise moves a clean feature vector to another vector in the same subspace, it cannot be compensated. Here the subspace is captured by , and the term nearest is meant in the measure of KL-divergence distance. Replacement of by can alternatively be justified as follows. Each is being represented by non-negative linear combinations of building blocks of speech data, because of which any noise component in cannot be reconstructed using speech bases, and hence cannot be retained in .
These features are used to parameterise the acoustic models (HMMs), as explained in Section 3.6.1.
Each set of basis activation coefficients (for both training and test data) is unique for the corresponding speech frame. An addition of noise changes the statistics of . So it is intuitive that equalising the statistics of during testing, to match that of training , improves the recognition. The equalisation has to be applied during training also, to avoid mismatch of the test features against the built acoustic model. Here the intention is not to perform test feature equalisation explicitly, but to get a dictionary which helps learn the test weights that are in equalised form, and thus are more robust. Figure 3.4.2 shows a method of obtaining the better dictionary from , and , using HEQ during training.
After decomposing as using NMF, a set of reference histograms of are calculated, one for each of its feature component. HEQ is performed on , as described in Section 3.5, to get better or more robust activations . But no longer matches . So a better is estimated using NMF update (2.4.5), i.e., by fixing , and updating . This essentially solves the optimisation
In other words, the speech dictionary is chosen such that the weights become statistically equalised during training. Experimental results show that is a better dictionary than in terms of noise-robustness. The training features are thus
Testing process is the same as described in Section 3.4.1, except that the new is directly used to estimate the test data weights , instead of . Additional equalisation is not performed during testing, since the dictionary itself helps in learning equalised weights. The test features are thus
As per convention, DCT and cepstral lifter matrices are applied on and , given by Eqs. (3.4.3) and (3.4.4) to obtain the 13 dimensional MFCC feature vectors for training and testing the HMMs. Here it can be seen that the computational cost of both NMF_plain and NMF_robustW are the same during testing.
3.5. Cascading with Existing Techniques
The proposed methods can be cascaded with HEQ, where reference histogram can be built using , and HEQ can be applied on and to get new features vectors, using which acoustic models can be built as described in Section 3.6.1.
Since the feature vectors given by Eqs. (3.4.1) and (3.4.2) are constrained to the column space of , there is certain additional correlation introduced in them. Their corresponding cepstral features and are made approximately decorrelated through the use of DCT. However, an additional decorrelation using HLDA is expected to further reduce the feature correlations, besides utilising the advantage of subspace projection. This also makes them more suitable for diagonal covariance modelling. The HLDA transformation matrix is estimated in maximum likelihood (ML) framework after building the acoustic models using 39 dimensional MFCCs as described in [Kumar and Andreou, 1998], and is applied to both feature vectors and the models in the conventional method.
3.6. Experiments and Results
3.6.1. Experimental Setup
Aurora-2 task [Hirsch and Pearce, 2000] has been used to perform a comparative study of the proposed techniques versus the existing ones. Aurora-2 consists of connected spoken digit utterances of TIDigits database, filtered and resampled to 8 kHz, and with noises added at different SNRs. The noises are sounds recorded in places such as train station, crowd of people, restaurant, interior of car etc. The availability of both clean and noisy versions of the training speech utterances makes them stereo in nature. The test set consists of 10 sets of utterances, each with one noise environment added, and each at seven distinct SNR levels.
The acoustic word models for each digit have been built using left to right continuous density HMMs with 16 states and 3 diagonal covariance Gaussian mixtures per state. For all the experiments, included MFCC vectors of 13 dimensions, obtained from the signal processing blocks, are appended with 13 delta and 13 acceleration coefficients to get a composite 39 dimensional vector per frame. Cepstral mean subtraction (CMS) has been performed on these vectors, and the resultant feature vectors are used for building the acoustic models for each digit, which in Aurora-2 task is a left to right continuous density HMM with 16 states and 3 diagonal covariance Gaussian mixtures per state. HMM Toolkit (HTK) 3.4.1 [Young et al., 2009] has been used for building and testing the acoustic models.
NMF has been implemented using MATLAB software. The size of the speech dictionary (in other words, the dimensionality of the speech subspace) has been optimised and chosen as throughout the experiments based on recognition results. It is to be noted that during NMF, LMFB feature vectors are of dimension . While performing NMF, has been initialised from random columns of , and with random numbers in . 500 iterations of NMF are performed in all the experiments.
For performing HEQ, quantile based method has been employed, dividing the range of cdf values into 100 quantiles. In the experiments including HLDA, all 39 directions are retained, since the aim is to observe only the performance improvement after nullifying the correlations.
Table 3.1 shows the recognition accuracies of the
various techniques on Aurora-2 database. Average values shown are
taken over SNR levels dB. Table (a)a
shows the accuracies of proposed methods. Tables (b)b
and (c)c show how the methods improve
when cascaded with HLDA and HEQ respectively. It can be observed that
HLDA can be used to achieve noise robustness in low noise levels only,
and fails at high noise levels (SNR to dB). In the PCA
method as shown in Figure 3.6.1, a dictionary of
size is an orthogonal matrix which retains all the directions,
thus corresponding exactly to the baseline system. It can be seen
that this method hardly gives an improvement over the baseline at
different dictionary sizes. NMF_plain gives
is an orthogonal matrix which retains all the directions, thus corresponding exactly to the baseline system. It can be seen that this method hardly gives an improvement over the baseline at different dictionary sizes. NMF_plain givesabsolute improvement in recognition accuracy, NMF_plain cascaded with HLDA gives , NMF_robustW gives , NMF_robustW+HLDA gives . NMF_plain+HEQ gives and NMF_robustW gives improvement.
Specifically at SNR 5 dB, the individual proposed methods NMF_plain and NMF_robustW give absolute improvements by and respectively. When combined with HEQ, the highest improvement achieved is at SNR and dB. Figure 3.6.1 shows the accuracy of using PCA as a feature processing step at different sizes of the dictionary.
The combination of NMF_plain+HEQ+HLDA achieved a recognition accuracy of . This is not a comparable improvement over the other proposed techniques, considering the increased computational complexity.
Methods described in Sections 3.4.1 and 3.4.2 are observed to give improvements in all noise conditions at all SNR levels, and give significantly better performances in moderate to high noise conditions. NMF_robustW+HEQ does not give improvement over NMF_plain+HEQ, both of them perform almost equally well. So when almost equalised weights are obtained through NMF_robustW, there is no advantage when an additional equalisation is done in cepstral domain.
The proposed methods have notable advantages of over other techniques in literature. There is no need of building a separate dictionary for capturing noise characteristics, for which training audio files containing pure noise would have been required. The methods operate on each speech frame independently, and so they can even handle very short utterances, unlike HEQ where the estimate of will be poor for short utterances. The methods are seen to give advantage when combined with other feature normalisation techniques. Finally the proposed methods are simple, easy to implement, and are achievable without the use of some commonly used additional tools like speech/silence detector, pre-built CDHMM model etc.
The proposed methods have a few disadvantages. The decomposition is iterative and the step size of the update rules (2.4.5) and (2.4.6) is small. So it takes many iterations to converge, and the number of iterations increases when the size of the database is large. Dictionary estimation from a very large database is computationally expensive and is limited by the availability of memory in the computing device. Iterations have to be performed even during testing, to determine the weights .
Thus the concept of building-blocks representation of speech is incorporated into the features, still preserving the advantages of using MFCCs. Conventional HMM based recogniser has been used to test the efficacy of the proposed MFCCs against the standard MFCCs.
4.1. Review of SPLICE
As discussed in the introduction, SPLICE algorithm makes the following two assumptions:
The noisy features follow a Gaussian mixture density of modes
The conditional density is the Gaussian
where are the clean features.
Thus, and parameterise the mixture specific linear transformations on the noisy vector . Here and are independent variables, and is dependent on them. Estimate of the cleaned feature can be obtained in MMSE framework as shown in Eq. (2.4.1).
The derivation of SPLICE transformations is briefly discussed next. Let and . Using independent pairs of stereo training features and maximising the joint log-likelihood
Alternatively, sub-optimal update rules of separately estimating
and can be derived by initially assuming to be identity matrix while estimating
to be identity matrix while estimating. The newly estimated is then used to estimate .
To reduce the number of parameters, a simplified model with only bias is proposed in literature [Deng et al., 2000].
A diagonal version of Eq. (4.1.6) can be written as
where runs along all components of the features and all mixtures. Since this method does not capture all the correlations, it suffers from performance degradation. This shows that noise has significant effect on feature correlations.
4.2. Proposed Modification to SPLICE
SPLICE assumes that a perfect correlation exists between clean and noisy stereo features (Eq. (4.1.5)), which makes the implementation simple [Afify et al., 2009]. But, the actual feature correlations are used to train SPLICE parameters, as seen in Eq. (4.1.9). Instead, if the training process also assumes perfect correlation and eliminates the term during parameter estimation, it complies with the assumptions and gives improved performance. This simple modification can be done as follows:
Eq. (4.1.11) can be rewritten as
where is the correlation coefficient. A perfect correlation implies . Since Eq. (4.1.5) makes this assumption, it can be enforced in the above equation to obtain
Similarly, for multidimensional case, the matrix should be enforced to be identity as per the assumption. Thus, the following is obtained:
Hence M-SPLICE and its updates are defined as
All the assumptions of conventional SPLICE are valid for M-SPLICE. Comparing both the methods, it can be seen from Eqs. (4.1.6) and (4.2.4) that while is obtained using MMSE estimation framework, is based on whitening expression. Also, involves cross-covariance term , whereas does not. The bias terms are computed in the same manner, using their respective transformation matrices, as seen in Eqs. (4.1.10) and (4.2.5). More analysis on M-SPLICE is given in Section 4.3.1.
The estimation procedure of M-SPLICE transformations is shown in Figure (a)a. The steps are summarised as follows:
Build noisy GMM111A non-standard term noisy mixture has been used to denote a Gaussian mixture built using noisy data. Similar meanings apply to clean mixture, noisy GMM and clean GMM. using noisy features of stereo data. This gives and .
For every noise frame , compute the alignment w.r.t. the noisy GMM, i.e., .
Using the alignments of stereo counterparts, compute the means and covariance matrices of each clean mixture from clean data .
Testing process of M-SPLICE is exactly same as that of conventional SPLICE, and is summarised as follows:
For each test vector , compute the alignment w.r.t. the noisy GMM, i.e., .
Compute the cleaned version as:
4.2.3. M-SPLICE with Diagonal Transformations
Techniques such as CMS, HEQ etc. operate on individual feature dimensions, assuming the features have diagonal covariance structures. This assumption is valid for MFCCs, since the use of DCT approximately decorrelates the features. Without significant loss of performance, M-SPLICE can also be extended in a similar fashion by constraining the covariance matrices and to be diagonal. Thus becomes diagonal, and Eq. 4.2.3 can be rewritten as
where . This implementation replaces the matrix multiplication in M-SPLICE by scalar product and addition operations.
4.3. Non-Stereo Extension
This section motivates and proposes the extension of M-SPLICE to datasets which are not stereo recorded. However some noisy training utterances, which are not necessarily the stereo counterparts of the clean data, are required.
Consider a stereo dataset of training frames . Suppose two mixture GMMs and are independently built using and respectively, and each data point is hard-clustered to the mixture giving the highest probability. The matrix , built as described below, is of interest:
where is indicator function. In other words, while parsing the stereo training data, when a stereo pair with clean part belonging to clean mixture and noisy part to noisy mixture is encountered, the element of the matrix is incremented by unity. Thus each element of the matrix denotes the number of stereo pairs belong to the clean noisy mixture-pair. When data are soft assigned to all the mixtures, the matrix can instead be built as:
Figure (a)a visualises such a matrix built using Aurora-2 stereo training data using mixture models. A dark spot in the plot represents a higher data count, and a bulk of stereo data points do belong to that mixture-pair.
In conventional SPLICE and M-SPLICE, only the noisy GMM is built, and not . are computed for every noisy frame, and the same alignments are assumed for the clean frames while computing and . Hence , and can be considered as the parameters of a clean hypothetical GMM . Now, given these GMMs and , the matrix can be constructed, which is visualised in Fig. ((b)b). Since the alignments are same, and clean mixture corresponds to the noisy mixture, a diagonal pattern can be seen.
Thus, under the assumption of Eq. (4.1.5), conventional SPLICE and M-SPLICE are able to estimate transforms from noisy mixture to exactly clean mixture by maintaining the mixture-correspondence.
When stereo not available, such exact mixture correspondence do not exist. Fig. (a)a makes this fact evident, since stereo property was not used while building the two independent GMMs. However, a sparse structure can be seen, which suggests that for most noisy mixtures , there exists a unique clean mixture having highest mixture-correspondence. This property can be exploited to estimate piecewise linear transformations from every mixture of to a single mixture of , ignoring all other mixtures . This is the basis for the proposed extension to non-stereo data.
In the absence of stereo data, the approach is to build two separate GMMs viz., clean and noisy during training, such that there exists mixture-to-mixture correspondence between them, as close to Fig. (b)b as possible. Then whitening based transforms can be estimated from each noisy mixture to its corresponding clean mixture. This sort of extension is not obvious in the conventional SPLICE framework, since it is not straight-forward to compute the cross-covariance terms without using stereo data. Also, M-SPLICE is expected to work better than SPLICE due to its advantages described earlier.
The training approach of two mixture-corresponded GMMs is as follows:
After building the noisy GMM , it is mean adapted by estimating a global MLLR transformation using clean training data. The transformed GMM has the same covariances and weights, and only means are altered to match the clean data. By this process, the mixture correspondences are not lost.
However, the transformed GMM need not model the clean data accurately. So a few (typically three) steps of expectation maximisation (EM) are performed using clean training data, initialising with the transformed GMM. This adjusts all the parameters and gives a more accurate representation of the clean GMM .
Now, the matrix obtained through this method using Aurora-2 training
data is visualised in Figure (c)c. It can be
noted that no stereo information has been used while obtaining ,
following the above mentioned steps, from . It can
be observed that a diagonal pattern is retained, as in the case of
M-SPLICE, though there are some outliers. Since stereo information
is not used, only comparable performances can be achieved. Figure
. It can be observed that a diagonal pattern is retained, as in the case of M-SPLICE, though there are some outliers. Since stereo information is not used, only comparable performances can be achieved. Figure(b)b shows the block diagram of estimating transformations of non-stereo method. The steps are summarised as follows:
Build noisy GMM using noisy features . This gives and .
Adapt the means of noisy GMM to clean data using global MLLR transformation.
Perform at least three EM iterations to refine the adapted GMM using clean data. This gives , thus and .
The testing process is exactly same as that of M-SPLICE, as explained in Section 4.2.2.
4.4. Additional Run-time Adaptation
To improve the performance of the proposed methods during run-time, GMM adaptation to the test condition can be done in both conventional SPLICE and M-SPLICE frameworks in a simple manner. Conventional MLLR adaptation on HMMs involves two-pass recognition, where the transformation matrices are estimated using the alignments obtained through first pass Viterbi-decoded output, and a final recognition is performed using the transformed models.
MLLR adaptation can be used to adapt GMMs in the context of SPLICE and M-SPLICE as follows:
Adapt the noisy GMM through a global MLLR mean transformation
Now, adjust the bias term in conventional SPLICE or M-SPLICE as
This method involves only simple calculation of alignments of the test data w.r.t. the noisy GMM, and doesn’t need Viterbi decoding. Clean mixture means computed during training need to be stored. A separate global MLLR mean transform can be estimated using test utterances belonging to each noise condition. The steps for testing process for run-time compensation are summarised as follows:
For all test vectors belonging to a particular environment, compute the alignments w.r.t. the noisy GMM, i.e., .
Estimate a global MLLR mean transformation using , maximising the likelihood w.r.t. .
Compute the adapted noisy GMM using the estimated MLLR transform. Only the means of the noisy GMM would have been adapted as .
Using Eq. (4.4.1), recompute the bias term of SPLICE or M-SPLICE.
Compute the cleaned test vectors as
4.5. Experiments and Results
4.5.1. Experimental Setup
All SPLICE based linear transformations have been applied on 13 dimensional MFCCs, including . Aurora-2 setup is the same as described in 3.6.1. During HMM training, the features are appended with 13 delta and 13 acceleration coefficients to get a composite 39 dimensional vector per frame. Cepstral mean subtraction (CMS) has been performed in all the experiments. 128 mixture GMMs are built for all SPLICE based experiments. Run-time noise adaptation in SPLICE framework is performed on 13 dimensional MFCCs. Data belonging to each SNR level of a test noise condition has been separately used to compute the global transformations. In all SPLICE based experiments, pseudo-cleaning of clean features has been performed.
To test the efficacy of non-stereo method on a database which does not contain stereo data, Aurora-4 task of 8 kHz sampling frequency has been used. Aurora-4 is a continuous speech recognition task with clean and noisy training utterances (non-stereo) and test utterances of 14 environments. Aurora-4 acoustic models are built using crossword triphone HMMs of 3 states and 6 mixtures per state. Standard WSJ0 bigram language model has been used during decoding of Aurora-4. Noisy GMM of 512 mixtures is built for evaluating non-stereo method, using 7138 utterances taken from both clean and multi-training data. This GMM is adapted to standard clean training set to get the clean GMM.
Table (a)a summarises the results of various algorithms discussed, on Aurora-2 dataset. All the results are shown in % accuracy. All SNRs levels mentioned are in decibels. The first seven rows report the overall results on all 10 test noise conditions. The rest of the rows report the average values in the SNR range dB. Table (b)b shows the results of run-time adaptation (indicated as RA) using various methods. For reference, the result of standard MLLR adaptation on HMMs [Gales, 1998] has been shown in Table (b)b, which computes a global 39 dimensional mean transformation, and uses two-pass Viterbi decoding. Table 4.2 shows the experimental results on Aurora-4 database. Table (a)a shows the results of non-stereo method on Aurora-4 database using clean-trained HMMs. Table (b)b shows the similar results for multi-trained HMMs, using the standard multi-training dataset.
It can be seen that M-SPLICE improves over SPLICE at all noise conditions and SNR levels and gives an absolute improvement of in test-set C and overall. Run-time compensation in SPLICE framework gives improvements over standard MLLR in test-sets A and B, whereas M-SPLICE gives improvements in all conditions. Here absolute improvement can be observed over SPLICE with run-time noise adaptation, and over standard MLLR. Finally, non-stereo method, though not using stereo data, shows and absolute improvements over Aurora-2 and Aurora-4 clean baseline models respectively, and a slight degradation w.r.t. SPLICE in all test cases. Run-time noise adaptation results of non-stereo method are comparable to that of standard MLLR, and are computationally less expensive. It can be observed that non-stereo method gives performance similar to that of multi-condition training.
In terms of computational cost, the methods M-SPLICE and non-stereo methods are identical during testing as compared to conventional SPLICE. Also, there is almost negligible increase in cost during training. The MLLR mean adaptation in both non-stereo method and run-time adaptation are computationally very efficient, and do not need Viterbi decoding. The diagonal versions of the proposed methods give comparable performances.
In terms of performance, M-SPLICE is able to achieve good results in all cases without any use of adaptation data, especially in unseen cases. In non-stereo method, one-to-one mixture correspondence is assumed between noise and clean GMMs. The method gives slight degradation in performance. This could be attributed to neglecting the outlier data.
Comparing with other existing feature normalisation techniques, the techniques in SPLICE framework operate on individual feature vectors, and no estimation of parameters is required from test data. So these methods do not suffer from test data insufficiency problems, and are advantageous for shorter utterances. Also, the testing process is usually faster, and are easily implementable in real-time applications. So by extending the methods to non-stereo data, we believe that they become more useful in many applications.
- [Afify et al., 2009] Afify, M., Cui, X., and Gao, Y. (2009). Stereo-based stochastic mapping for robust speech recognition. IEEE Trans. on Audio, Speech and Lang. Proc., 17(7):1325–1334.
- [Chijiiwa et al., 2012] Chijiiwa, K., Suzuki, M., Minematsu, N., and Hirose, K. (2012). Unseen noise robust speech recognition using adaptive piecewise linear transformation. In Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on, pages 4289–4292.
- [Deng et al., 2000] Deng, L., Acero, A., Plumpe, M., and Huang, X. (2000). Large-vocabulary speech recognition under adverse acoustic environments. In International Conference on Spoken Language Processing, pages 806–809.
- [Droppo and Acero, 2005] Droppo, J. and Acero, A. (2005). Maximum mutual information splice transform for seen and unseen conditions. In INTERSPEECH, pages 989–992.
- [Droppo et al., 2002] Droppo, J., Acero, A., and Deng, L. (2002). Uncertainty decoding with splice for noise robust speech recognition. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, pages I–57. IEEE.
- [Gales, 1998] Gales, M. J. (1998). Maximum likelihood linear transformations for hmm-based speech recognition. Computer speech & language, 12(2):75–98.
- [Gemmeke et al., 2011] Gemmeke, J. F., Virtanen, T., and Hurmalainen, A. (2011). Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Transactions on Audio, Speech & Language Processing, 19(7):2067–2080.
- [Gonzalez et al., 2011] Gonzalez, J., Peinado, A., Gomez, A., and Carmona, J. (2011). Efficient MMSE estimation and uncertainty processing for multienvironment robust speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 19(5):1206–1220.
- [Hilger and Ney, 2006] Hilger, F. and Ney, H. (2006). Quantile based histogram equalization for noise robust large vocabulary speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 14(3):845–854.
- [Hirsch and Pearce, 2000] Hirsch, H.-G. and Pearce, D. (2000). The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW).
- [Kumar, 1997] Kumar, N. (1997). Investigation of Silicon Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition. PhD thesis, Johns Hopkins University.
- [Kumar and Andreou, 1998] Kumar, N. and Andreou, A. G. (1998). Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition. Speech Communication, 26(4):283–297.
- [Lee and Seung, 1999] Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791.
- [Lee and Seung, 2000] Lee, D. D. and Seung, H. S. (2000). Algorithms for non-negative matrix factorization. In In NIPS, pages 556–562. MIT Press.
- [Legetter, 1995] Legetter, C. J. (1995). Improved acoustic modeling for HMMs using linear transformation. PhD thesis, University of Cambridge.
- [Rabiner and Schafer, 2010] Rabiner, L. and Schafer, R. (2010). Theory and Applications of Digital Speech Processing. Prentice Hall Press, Upper Saddle River, NJ, USA, 1st edition.
- [Sainath et al., 2012] Sainath, T., Ramabhadran, B., Nahamoo, D., Kanevsky, D., Van Compernolle, D., Demuynck, K., Gemmeke, J., Bellegarda, J., and Sundaram, S. (2012). Exemplar-based processing for speech recognition: An overview. IEEE Signal Processing Magazine, 29(6):98–113.
- [Schuller et al., 2010] Schuller, B., Weninger, F., Wollmer, M., Sun, Y., and Rigoll, G. (2010). Non-negative matrix factorization as noise-robust feature extractor for speech recognition. In Acoustics Speech and Signal Processing (ICASSP), IEEE International Conference on, pages 4562–4565.
- [Shinohara et al., 2008] Shinohara, Y., Masuko, T., and Akamine, M. (2008). Feature enhancement by speaker-normalized splice for robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on, pages 4881–4884. IEEE.
- [Smaragdis, 2007] Smaragdis, P. (2007). Convolutive speech bases and their application to supervised speech separation. Audio, Speech, and Language Processing, IEEE Transactions on, 15(1):1–12.
- [Wilson et al., 2008] Wilson, K., Raj, B., Smaragdis, P., and Divakaran, A. (2008). Speech denoising using nonnegative matrix factorization with priors. In Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on, pages 4029–4032.
- [Young et al., 2009] Young, S. et al. (2009). The HTK book (revised for HTK version 3.4.1). Cambridge University.