1 Introduction
The goal of robust speech recognition is to build systems that can work under different noisy environment conditions. Due to the acoustic mismatch between training and test conditions, the performance degrades under noisy environments. Model Adaptation and Feature Compensation are two classes of techniques that address this problem. The former methods adapt the trained models to match the environment, and the latter methods compensate either or both noisy and clean features so that they have similar characteristics.
Stereo based piecewise linear compensation for environments (SPLICE) is a popular and efficient noise robust feature enhancement technique. It partitions the noisy feature space into
classes, and learns a linear transformation based noise compensation for each partition class during training, using stereo data. Any test vector
is softassigned to one or more classes by computing , and is compensated by applying the weighted combination of linear transformations to get the cleaned version .(1) 
and
are estimated during training using stereo data. The training noisy vectors
are modelled using a Gaussian mixture model (GMM)
of mixtures, andis calculated for a test vector as a set of posterior probabilities w.r.t the GMM
. Thus the partition class is decided by the mixture assignments .Over the last decade, techniques such as maximum mutual information based training [1], speaker normalisation [2], uncertainty decoding [3] etc. were introduced in SPLICE framework. There are two disadvantages of SPLICE. The algorithm fails when the test noise condition is not seen during training. Also, owing to its requirement of stereo data for training, the usage of the technique is quite restricted. So there is an interest in addressing these issues.
In a recent work [4], an adaptation framework using EigenSPLICE was proposed to address the problems of unseen noise conditions. The method involves preparation of quasi stereo data using the noise frames extracted from nonspeech portions of the test utterances. For this, the recognition system is required to have access to some clean training utterances for performing runtime adaptation.
In [5], a stereobased feature compensation method was proposed. Clean and noisy feature spaces were partitioned into vector quantised (VQ) regions. The stereo vector pairs belonging to VQ region in clean space and
VQ region in noisy space are classified to the
subregion. Transformations based on Gaussian whitening expression were estimated from every noisy subregion to clean subregion. But it is not always guaranteed to have enough data to estimate a full transformation matrix from each subregion to other.In this paper, we propose a simple modification based on an assumption made by SPLICE on the correlation of training stereo data, which improves the performance in unseen noise conditions. This method does not need any adaptation data, in contrast to the recent work proposed in literature [4]. We call this method as modified SPLICE (MSPLICE). We also extend MSPLICE to work for datasets that are not stereo recorded, with minimal performance degradation as compared to conventional SPLICE. Finally, we use an MLLR based runtime noise adaptation framework, which is computationally efficient and achieves better results than MLLR HMMadaptation. This method is done on 13 dimensional MFCCs and does not require twopass Viterbi decoding, in contrast to conventional MLLR done on 39 dimensions.
The rest of the paper is organised as follows: a review of SPLICE is given in Section 2, proposed modification to SPLICE is presented in Section 3, extension to nonstereo datasets is explained in Section 4, runtime noise adaptation is described in Section 5, experiments and results are presented in Section 6, detailed discussion and comparison of existing versus proposed techniques is given in Section 7 and the paper is concluded in Section 8 indicating possible future extensions.
2 Review of SPLICE
As discussed in the introduction, SPLICE algorithm makes the following two assumptions:

The noisy features follow a Gaussian mixture density of modes
(2) 
The conditional density is the Gaussian
(3) where are the clean features.
Thus, and parameterise the mixture specific linear transformations on the noisy vector . Here and are independent variables, and is dependent on them. Estimate of the cleaned feature can be obtained in MMSE framework as shown in Eq. (1).
The derivation of SPLICE transformations is briefly discussed next. Let and . Using independent pairs of stereo training features and maximising the joint loglikelihood
(4) 
yields
(5) 
Alternatively, suboptimal update rules of separately estimating and can be derived by initially assuming
to be identity matrix while estimating
, and then using this to estimate .A perfect correlation between and is assumed, and the following approximation is used in deriving Eq. (5) [6].
(6) 
Given mixture index , Eq. (5) can be shown to give the MMSE estimator of [7], given by
(7) 
where
(8) 
(9) 
i.e., the alignments are being used in place of and in Eqs. (8) and (9) respectively. Thus from (7),
(10) 
(11) 
To reduce the number of parameters, a simplified model with only bias is proposed in literature [7].
A diagonal version of Eq. (7) can be written as
(12) 
where runs along all components of the features and all mixtures. Since this method does not capture all the correlations, it suffers from performance degradation. This shows that noise has significant effect on feature correlations.
3 Proposed Modification to SPLICE
SPLICE assumes that a perfect correlation exists between clean and noisy stereo features (Eq. (6)), which makes the implementation simple [6]. But, the actual feature correlations are used to train SPLICE parameters, as seen in Eq. (10). Instead, if the training process also assumes perfect correlation and eliminates the term during parameter estimation, it complies with the assumptions and gives improved performance. This simple modification can be done as follows:
Eq. (12) can be rewritten as
where is the correlation coefficient. A perfect correlation implies . Since Eq. (6) makes this assumption, we enforce it in the above equation and obtain
Similarly, for multidimensional case, the matrix should be enforced to be identity as per the assumption. Thus, we obtain
(13) 
Hence MSPLICE and its updates are defined as
(14) 
(15)  
(16) 
All the assumptions of conventional SPLICE are valid for MSPLICE. Comparing both the methods, it can be seen from Eqs. (7) and (15) that while is obtained using MMSE estimation framework, is based on whitening expression. Also, involves crosscovariance term , whereas does not. The bias terms are computed in the same manner, using their respective transformation matrices, as seen in Eqs. (11) and (16). More analysis on MSPLICE is given in Section 4.1.
3.1 Training
The estimation procedure of MSPLICE transformations is shown in Figure (a)a. The steps are summarised as follows:

Build noisy GMM^{1}^{1}1We use the term noisy mixture to denote a Gaussian mixture built using noisy data. Similar meanings apply to clean mixture, noisy GMM and clean GMM. using noisy features of stereo data. This gives and .

For every noise frame , compute the alignment w.r.t. the noisy GMM, i.e., .

Using the alignments of stereo counterparts, compute the means and covariance matrices of each clean mixture from clean data .
3.2 Testing
Testing process of MSPLICE is exactly same as that of conventional SPLICE, and is summarised as follows:

For each test vector , compute the alignment w.r.t. the noisy GMM, i.e., .

Compute the cleaned version as:
4 NonStereo Extension
In this section, we motivate how MSPLICE can be extended to datasets which are not stereo recorded. However some noisy training utterances, which are not necessarily the stereo counterparts of the clean data, are required.
4.1 Motivation
Consider a stereo dataset of training frames . Suppose two mixture GMMs and are independently built using and respectively, and each data point is hardclustered to the mixture giving the highest probability. We are interested in analysing a matrix , built as
where is indicator function. In other words, while parsing the stereo training data, when a stereo pair with clean part belonging to clean mixture and noisy part to noisy mixture is encountered, the element of the matrix is incremented by unity. Thus each element of the matrix denotes the number of stereo pairs belong to the clean noisy mixturepair. When data are soft assigned to all the mixtures, the matrix can instead be built as:
Figure (a)a visualises such a matrix built using Aurora2 stereo training data using mixture models. A dark spot in the plot represents a higher data count, and a bulk of stereo data points do belong to that mixturepair.
In conventional SPLICE and MSPLICE, only the noisy GMM is built, and not . are computed for every noisy frame, and the same alignments are assumed for the clean frames while computing and . Hence , and can be considered as the parameters of a clean hypothetical GMM . Now, given these GMMs and , the matrix can be constructed, which is visualised in Figure ((b)b). Since the alignments are same, and clean mixture corresponds to the noisy mixture, a diagonal pattern can be seen.
Thus, under the assumption of Eq. (6), conventional SPLICE and MSPLICE are able to estimate transforms from noisy mixture to exactly clean mixture by maintaining the mixturecorrespondence.
When stereo data is not available, such exact mixture correspondence do not exist. Figure (a)a makes this fact evident, since stereo property was not used while building the two independent GMMs. However, a sparse structure can be seen, which suggests that for most noisy mixtures , there exists a unique clean mixture having highest mixturecorrespondence. This property can be exploited to estimate piecewise linear transformations from every mixture of to a single mixture of , ignoring all other mixtures . This is the basis for the proposed extension to nonstereo data.
4.2 Implementation
In the absence of stereo data, our approach is to build two separate GMMs viz., clean and noisy during training, such that there exists mixturetomixture correspondence between them, as close to Fig. (b)b as possible. Then whitening based transforms can be estimated from each noisy mixture to its corresponding clean mixture. This sort of extension is not obvious in the conventional SPLICE framework, since it is not straightforward to compute the crosscovariance terms without using stereo data. Also, MSPLICE is expected to work better than SPLICE due to its advantages described earlier.
The training approach of two mixturecorresponded GMMs is as follows:

After building the noisy GMM , it is mean adapted by estimating a global MLLR transformation using clean training data. The transformed GMM has the same covariances and weights, and only means are altered to match the clean data. By this process, the mixture correspondences are not lost.

However, the transformed GMM need not model the clean data accurately. So a few steps of expectation maximisation (EM) are performed using clean training data, initialising with the transformed GMM. This adjusts all the parameters and gives a more accurate representation of the clean GMM .
Now, the matrix obtained through this method using Aurora2 training data is visualised in Figure (c)c. It can be noted that no stereo information has been used while obtaining , following the above mentioned steps, from
. It can be observed that a diagonal pattern is retained, as in the case of MSPLICE, though there are some outliers. Since stereo information is not used, only comparable performances can be achieved. Figure
(b)b shows the block diagram of estimating transformations of nonstereo method. The steps are summarised as follows:
Build noisy GMM using noisy features . This gives and .

Adapt the means of noisy GMM to clean data using global MLLR transformation.

Perform at least three EM iterations to refine the adapted GMM using clean data. This gives , thus and .
The testing process is exactly same as that of MSPLICE, as explained in Section 3.2.
5 Additional Runtime Adaptation
To improve the performance of the proposed methods during runtime, GMM adaptation to the test condition can be done in both conventional SPLICE and MSPLICE frameworks in a simple manner. Conventional MLLR adaptation on HMMs involves twopass recognition, where the transformation matrices are estimated using the alignments obtained through first pass Viterbidecoded output, and a final recognition is performed using the transformed models.
MLLR adaptation can be used to adapt GMMs in the context of SPLICE and MSPLICE as follows:

Adapt the noisy GMM through a global MLLR mean transformation

Now, adjust the bias term in conventional SPLICE or MSPLICE as
(17)
This method involves only simple calculation of alignments of the test data w.r.t. the noisy GMM, and doesn’t need Viterbi decoding. Clean mixture means computed during training need to be stored. A separate global MLLR mean transform can be estimated using test utterances belonging to each noise condition. The steps for testing process for runtime compensation are summarised as follows:


Clean  Car  Babble  Street  Restaurant  Airport  Station  Average  
Baseline  Mic1  87.63  75.58  52.77  52.83  46.53  56.38  45.30  54.73 
Mic2  77.40  64.39  45.15  42.03  36.26  47.69  36.32  
NonStereo Method  Mic1  86.85  77.71  62.62  58.96  55.93  61.95  55.37  61.66 
Mic2  79.10  68.58  55.24  51.67  45.88  55.45  47.88 

For all test vectors belonging to a particular environment, compute the alignments w.r.t. the noisy GMM, i.e., .

Estimate a global MLLR mean transformation using , maximising the likelihood w.r.t. .

Compute the adapted noisy GMM using the estimated MLLR transform. Only the means of the noisy GMM would have been adapted as .

Using Eq. (17), recompute the bias term of SPLICE or MSPLICE.

Compute the cleaned test vectors as
6 Experimental Setup
Aurora2 task of 8 kHz sampling frequency [8] has been used to perform comparative study of the proposed techniques with the existing ones. Aurora2 consists of connected spoken digits with stereo training data. The test set consists of utterances of ten different environments, each at seven distinct SNR levels. The acoustic word models for each digit have been built using left to right continuous density HMMs with 16 states and 3 diagonal covariance Gaussian mixtures per state. HMM Toolkit (HTK) 3.4.1 has been used for building and testing the acoustic models.
All SPLICE based linear transformations have been applied on 13 dimensional MFCCs, including . During HMM training, the features are appended with 13 delta and 13 acceleration coefficients to get a composite 39 dimensional vector per frame. Cepstral mean subtraction (CMS) has been performed in all the experiments. 128 mixture GMMs are built for all SPLICE based experiments. Runtime noise adaptation in SPLICE framework is performed on 13 dimensional MFCCs. Data belonging to each SNR level of a test noise condition has been separately used to compute the global transformations. In all SPLICE based experiments, pseudocleaning of clean features has been performed.
To test the efficacy of nonstereo method on a database which doesn’t contain stereo data, Aurora4 task of 8 kHz sampling frequency has been used. Aurora4 is a continuous speech recognition task with clean and noisy training utterances (nonstereo) and test utterances of 14 environments. Aurora4 acoustic models are built using crossword triphone HMMs of 3 states and 6 mixtures per state. Standard WSJ0 bigram language model has been used during decoding of Aurora4. Noisy GMM of 512 mixtures is built for evaluating nonstereo method, using 7138 utterances taken from both clean and multitraining data. This GMM is adapted to standard clean training set to get the clean GMM.
6.1 Results
Tables (a)a and (b)b summarise the results of various algorithms discussed, on Aurora2 dataset. All the results are shown in % accuracy. All SNRs levels mentioned are in decibels. The first seven rows report the overall results on all 10 test noise conditions. The rest of the rows report the average values in the SNR range 200 dB. Table 2 shows the experimental results on Aurora4 database.
For reference, the result of standard MLLR adaptation on HMMs [9] has been shown in Table (b)b, which computes a global 39 dimensional mean transformation, and uses twopass Viterbi decoding.
It can be seen that MSPLICE improves over SPLICE at all noise conditions and SNR levels and gives an absolute improvement of in testset C and overall. Runtime compensation in SPLICE framework gives improvements over standard MLLR in testsets A and B, whereas MSPLICE gives improvements in all conditions. Here absolute improvement can be observed over SPLICE with runtime noise adaptation, and over standard MLLR. Finally, nonstereo method, though not using stereo data, shows and absolute improvements over Aurora2 and Aurora4 baseline models respectively, and a slight degradation w.r.t. SPLICE in all test cases. Runtime noise adaptation results of nonstereo method are comparable to that of standard MLLR, and are computationally less expensive.
7 Discussion
In terms of computational cost, the methods MSPLICE and nonstereo methods are identical during testing as compared to conventional SPLICE. Also, there is almost negligible increase in cost during training. The MLLR mean adaptation in both nonstereo method and runtime adaptation are computationally very efficient, and do not need Viterbi decoding.
In terms of performance, MSPLICE is able to achieve good results in all cases without any use of adaptation data, especially in unseen cases. In nonstereo method, onetoone mixture correspondence between noise and clean GMMs is assumed. The method gives slight degradation in performance. This could be attributed to neglecting the outlier data.
Comparing with other existing feature normalisation techniques, the techniques in SPLICE framework operate on individual feature vectors, and no estimation of parameters is required from test data. So these methods do not suffer from test data insufficiency problems, and are advantageous for shorter utterances. Also, the testing process is usually faster, and are easily implementable in realtime applications. So by extending the methods to nonstereo data, we believe that they become more useful in many applications.
8 Conclusion and Future Work
A modified version of the SPLICE algorithm has been proposed for noise robust speech recognition. It is better compliant with the assumptions of SPLICE, and improves the recognition in highly mismatched and unseen noise conditions. An extension of the methods to nonstereo data has been presented. Finally, a convenient runtime adaptation framework has been explained, which is computationally much cheaper than standard MLLR on HMMs. In future, we would like to improve the efficiency of nonstereo extensions of SPLICE, and extend MSPLICE in uncertainty decoding framework.
References
 [1] J. Droppo and A. Acero, “Maximum mutual information splice transform for seen and unseen conditions.,” in INTERSPEECH, pp. 989–992, 2005.
 [2] Y. Shinohara, T. Masuko, and M. Akamine, “Feature enhancement by speakernormalized splice for robust speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on, pp. 4881–4884, IEEE, 2008.
 [3] J. Droppo, A. Acero, and L. Deng, “Uncertainty decoding with splice for noise robust speech recognition,” in Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, vol. 1, pp. I–57, IEEE, 2002.
 [4] K. Chijiiwa, M. Suzuki, N. Minematsu, and K. Hirose, “Unseen noise robust speech recognition using adaptive piecewise linear transformation,” in Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference on, pp. 4289–4292, IEEE, 2012.
 [5] J. Gonzalez, A. Peinado, A. Gomez, and J. Carmona, “Efficient MMSE estimation and uncertainty processing for multienvironment robust speech recognition,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 5, pp. 1206–1220, 2011.
 [6] M. Afify, X. Cui, and Y. Gao, “Stereobased stochastic mapping for robust speech recognition,” IEEE Trans. on Audio, Speech and Lang. Proc., vol. 17, pp. 1325–1334, Sept. 2009.
 [7] L. Deng, A. Acero, M. Plumpe, and X. Huang, “Largevocabulary speech recognition under adverse acoustic environments,” in International Conference on Spoken Language Processing, pp. 806–809, 2000.
 [8] H.G. Hirsch and D. Pearce, “The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions,” in Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW), 2000.
 [9] M. J. Gales, “Maximum likelihood linear transformations for hmmbased speech recognition,” Computer speech & language, vol. 12, no. 2, pp. 75–98, 1998.
Comments
There are no comments yet.