1 Introduction
DNN based acoustic models have been stateoftheart for automatic speech recognition over the past few years [1]
. While DNN input consists of multiple frames of acoustic features, the target output is obtained from a frame level GMMHMM forced alignment corresponding to the context dependent tied triphone states or senones
[2]. This procedure results in inefficiency in DNN acoustic modeling [3, 4]. Unlike the conventional practice, the present work argues that the optimal DNN targets are probability distributions rather than Kronecker deltas (hard targets). Earlier studies on optimal training of a neural network for HMM decoding provides rigorous theoretical analysis that supports this idea
[5]. Here, we propose a DNN based data driven framework to obtain accurate probability distributions (soft targets) for improved DNN acoustic modeling. The proposed approach relies on modeling of lowdimensional senone subspaces in DNN posterior probabilities.
Speech production is known as the result of activations of a few highly constrained articulatory mechanisms leading to generation of linguistic units (e.g. phones, senones) on lowdimensional nonlinear manifolds [6, 7]. In the context of DNN acoustic modeling, lowdimensional structures are exhibited in the space of DNN senone posteriors [8]. Lowrank and sparse representations are found promising to characterize senonespecific subspaces [9, 10]. The senonespecific structures are superimposed with highdimensional unstructured noise. Hence, projection of DNN posteriors on their underlying lowdimensional subspaces enhances the DNN posterior accuracies. In this work, we propose a new application of enhanced DNN posteriors to generate accurate soft targets for DNN acoustic modeling.
Earlier works on exploiting lowdimensionality in DNN acoustic modeling focus on exploiting lowrank and sparse representations to modify DNN architectures for small footprint implementation. In [11, 12] lowrank decomposition of the neural network’s weight matrices enables reduction in DNN complexity and memory footprint. Similar goals have been achieved by exploiting sparse connections [13] and sparse activations [14] in hidden layers of DNN. In another line of research, soft targets based DNN training has been found effective for enabling model compression [15, 16] and knowledge transfer from an accurate complex model to a smaller network [17]. This approach relies on soft targets providing more information for DNN training than the binary hard alignments.
We propose to bring together the advantage of higher information content of soft targets with the accurate model of senone space provided by lowrank and sparse representations to train superior DNN acoustic models. Soft targets enable characterization of the senonespecific subspaces by quantifying the correlations between senone classes as well as sequential dependencies (details in Section 2.1
). This information is manifested in the form of structures visible among a large population of training data posterior probabilities. Potential of these posteriors to be used as soft targets for DNN training is reduced due to presence of unstructured noise. Therefore, to obtain reliable soft targets, we perform lowrank and sparse reconstruction of training data posteriors to preserve the global lowdimensional structures while discarding the random highdimensional noise. The new DNNs trained with lowrank or sparse soft targets are capable of estimating the test posteriors on a lowdimensional space which results in better ASR performance. We consider PCA (Section
2.2) and dictionary based sparse coding (Section 2.3) for generating lowrank and sparse representations respectively. Strength of PCA lies in capturing the linear regularities in the data [18] whereas an overcomplete dictionary used for sparse coding learns to model the nonlinear space as a union of lowdimensional subspaces. Dictionary based sparse reconstruction also reduces the rank of the senone posterior space [9].Experimental evaluations are conducted on AMI corpus [19], a collection of recordings of multiparty meetings for large vocabulary speech recognition. We show in Section 3 that lowrank and sparse soft targets lead to training of better DNN acoustic models. Reductions in word error rate (WER) are observed over the baseline hybrid DNNHMM system without the need of explicit sparse coding or lowrank reconstruction of test data posteriors. Moreover, they enable effective use of outofdomain untranscribed data by augmenting AMI training data in a knowledge transfer fashion. DNNs trained with lowrank and sparse soft targets yield upto 4.6% relative improvement in WER, whereas a DNN trained with nonenhanced soft targets fails to exploit any further knowledge provided by the untranscribed data. To the best of our knowledge, significant benefit of DNN generated soft targets for training a more accurate DNN acoustic model has not been shown in the prior work.
2 LowRank and Sparse Soft Targets
This section describes the novel approach towards reliable soft target estimation. We study reasons for regularities among senone posteriors and investigate two systematic approaches to obtain more accurate probabilities as soft targets for DNN acoustic modeling.
2.1 Towards Better Targets for DNN Training
Earlier works on distillation of the DNN knowledge show the potential of soft targets for model compression and the suboptimal nature of the hard alignments [15, 20]. Although hard targets assign a particular senone label to a relatively long sequence of (10 or more) acoustic frames, senone durations are usually shorter. A long context of input frames may lead to presence of acoustic features corresponding to multiple senones in the input (Fig. 1(a)), so the assumption of binary outputs renders inaccurate.
In contrast, soft outputs quantify such sequential information using nonzero probabilities for multiple senone classes. Contextual senone dependencies arising in soft targets can be attributed to the ambiguities due to phonetic transitions [20]. Furthermore, the procedure of senone extraction leads to acoustic correlations among multiple classes corresponding to the same phoneHMM states [2], as they all share the same root in the decision tree (Fig. 1(b)).
These dependencies can be characterized by analyzing a large number of senone probabilities from the training data. The frequent dependencies are exhibited as regularities among the correlated dimensions in senone posteriors. As a result, a matrix formed by concatenation of classspecific senone posteriors has a lowrank structure. In other words, classspecific senones lie in lowdimensional subspaces with a dimension higher than unity [9], that violates the principal assumption of binary hard targets.
In practice, inaccuracies in DNN training lead to the presence of unstructured highdimensional errors (Fig. 1(c)). Therefore, the initial senone posterior probabilities obtained from the forward pass of a DNN trained with hard alignments are not accurate in quantifying the senone dependency structures. Our previous work demonstrates that the erroneous estimations can be separated using lowrank or sparse representations [10, 9]. In the present study, we consider application of PCA and sparse coding to obtain more reliable soft targets for DNN acoustic model training.
2.2 Lowrank Reconstruction Using Eigenposteriors
Let denote a forward pass estimate of the posterior probabilities of senone classes , given the acoustic feature at time . DNN is trained using the initial labels obtained from GMMHMM forced alignment. We collect senone posteriors which are labeled as class in GMMHMM forced alignment and meancenter them in the logarithmic domain as follows:
(1) 
where
is mean of the collected posteriors in logdomain. Due to skewed distribution of the posterior vectors, the logarithm of posteriors fits better the Gaussian assumption of PCA. We concatenate the
senone posterior vectors after operation shown in (1) to form a matrix . For the sake of brevity, the subscript is dropped in the subsequent expressions. However, all the calculations are performed for each of the senone classes individually.Principal components of the senone space are obtained via eigenvector decomposition
[21] of covariance matrix of . The covariance matrix is obtained as . We factorize the covariance matrix as where identifies the eigenvectors andis a diagonal matrix containing the sorted eigenvalues. Eigenvectors in
which correspond to the large eigenvalues in constitute the frequent regularities in the subspace, whereas others carry the highdimensional unstructured noise. Hence, the lowrank projection matrix is defined as(2) 
where is truncation of that keeps only the first eigenvectors and discards the erroneous variability captured by other components. We select such that relatively % variability is preserved in the lowrank reconstruction of original senone matrix .
The eigenvectors stored in the lowrank projection are referred to as “eigenposteriors” of the senone space (in the same spirit as eigenfaces are defined for lowdimensional modeling of human faces [22]).
Lowrank reconstruction of a meancentered log posterior , denoted by is estimated as
(3) 
Finally, we add the mean to and take its exponent to obtain a lowrank senone posterior for the acoustic frame . Lowrank posteriors obtained for the training data are used as soft targets for learning better DNNs (Fig.2). We assume that % variability, that quantifies the lowrank regularities in senone spaces, is a parameter independent of the senone class.
2.3 Sparse Reconstruction Using Dictionary Learning
Unlike PCA, overcomplete dictionary learning and sparse coding enables modeling of nonlinear lowdimensional manifolds. Sparse modelling assumes that senone posteriors can be generated as sparse linear combination of senone space representatives, collected in a dictionary . We use online dictionary learning algorithm [23] to learn an overcomplete dictionary for senone using a collection of training data posteriors of senone , such that
(4) 
where and is a regularization factor. Again we have dropped the subscript , but all calculations are still senonespecific. Sparse reconstruction (Fig.2) of senone posteriors is thus obtained by first estimating the sparse representation [24] as
(5) 
followed by reconstruction as
(6) 
Sparse reconstructed senone posteriors have been previously found to be more accurate acoustic models for DNNHMM speech recognition [9]. In particular, it was shown that the rank of senonespecific matrices is much lower after sparse reconstruction. In the present work, we investigate if they could also provide more accurate soft targets for DNN training Regularization parameter in (4)(5) controls the level of sparsity and the level of noise being removed after sparse reconstruction. Fig. 2 summarises the lowrank and sparse reconstruction of senone posteriors.
3 Experimental Analysis
In this section we evaluate the effectiveness of lowrank and sparse soft targets to improve the performance of DNNHMM speech recognition. We also investigate the importance of better DNN acoustic models to exploit information from untranscribed data.
System #  Training Data  PCA(=80)  Sparsity(=0.1)  NonEnhanced SoftTargets 

0  AMI (Baseline WER 32.4%)       
1  AMI(SE0)  31.9  31.6  32.0 
2  ICSI(FP1) + AMI(SE0)  31.2  31.6  32.5 
3  LIB100(FP1) + AMI(SE0)  31.2  31.6  32.4 
4  LIB100(FP2) + AMI(SE0)  31.0  31.8  32.4 
5  LIB100(FP2) + ICSI(FP2) + AMI(SE0)  30.9  31.7  32.4 
3.1 Database and Speech Features
Experiments are conducted on AMI corpus [19] which contains recordings of spontaneous conversations in meeting scenarios. We use recordings from individual head microphones (IHM) comprising of around 67 hours of train set, 9 hours of development, (dev) set, and 7 hours test set. 10% of training data is used for crossvalidation during DNN training, whereas dev set is used for tuning regularization parameters and . For experiments using untranscribed additional training data, we use ICSI meeting corpus [25] and Librispeech corpus [26]. Data from ICSI corpus consists of meeting recordings (around 70 hours). Librispeech data is read speech from audiobooks and we use a 100 hour subset of it.
Kaldi toolkit [27]
is used for training DNNHMM systems. All DNNs have 9 frames of temporal context at acoustic input and 4 hidden layers with 1200 neurons each. Input features are 39 dimensional MFCC+
+ (399=351 dimensional input) and output is 4007 dimensional senone probability vector. AMI pronunciation dictionary has 23K words and a bigram model for decoding. For dictionary learning and sprase coding, SPAMS toolbox [28] is used.3.2 Baseline DNNHMM using Hard and Soft Targets
Our baseline is a hybrid DNNHMM system trained using forced aligned targets (IHM setup in [29]). WER using baseline DNN is 32.4% on AMI test
set. Another baseline is a DNN trained using nonenhanced soft targets from the baseline. This system gives a WER of 32.0%. All softtarget based DNNs are randomly initialized and trained using crossentropy loss backpropagation.
3.3 Generation of Lowrank and Sparse Soft Targets
We group DNN forward pass senone probabilities for the training data into classspecific senone matrices. For this, senone labels from the ground truth based GMMHMM hard alignments are used. Each matrix is restricted to have vectors of
senone probabilities to facilitate computation of principal components and sparse dictionary learning. We found the average rank of senone matrices, defined as the number of singular values required to preserve 95% variability, to be 44. Dictionaries of size 500 columns were learned for each senone, making them nearly 10 times overcomplete. The procedure as depicted in Fig.
2 is implemented to generate lowrank and sparse softtargets.We also encountered memory issues while storing large matrices of senone probabilities for all training and crossvalidation data. It requires enormous amounts of storage space (similar to [16]). Hence, we preserve precision only upto first two decimal places in soft targets, followed by normalizing the vector to sum 1 before storing on the disk. We assume that essential information might not be in dimensions with very small probabilities. Although such thresholding can be a compromise to our approach, we did some experiments with higher precision (upto 5 decimal places), but there was no significant improvement in ASR. Both lowrank and sparse reconstruction were still computed on full softtargets without any rounding; we perform thresholding only when storing targets on the disk.
First we tune the variability preserving lowrank reconstruction parameter and sparsity regularizer for better ASR performance in AMI dev set. When =80% of variability is preserved in the principal components space, the most accurate soft targets are achieved for DNN acoustic modeling resulting in the smallest WER. Likewise, was found the optimal value for sparse reconstruction. It may be noted that in both lowrank and sparse reconstruction, there is an optimal amount of enhancement needed for improving ASR. While less enhancement leads to continued presence of noise in soft targets, too much of it results in loss of essential information.
3.4 DNNHMM Speech Recognition
Speech recognition using DNNs trained with the new soft targets obtained from lowrank and sparse reconstruction is compared in Table 1). System0 is the baseline hard target based DNN. System1 is built by supervised enhancement of soft outputs obtained from system0 on AMI training data as shown in Fig. 2. As expected, training with the soft targets yields lower WER than the baseline hard targets. We can see that both PCA and sparse reconstruction result in more accurate acoustic modeling, where sparse reconstruction achieves 0.8% absolute reduction in WER.
Sparse reconstruction is found to work better than lowrank reconstruction for ASR. It can be due to the higher accuracy of sparse model in characterizing the nonlinear senone subspaces [8]. Unlike previous works [9, 10] which required two stages of DNN forward pass and explicit lowdimensional projection, a single DNN is learned here that estimates the probabilities directly on a lowdimensional space.
3.5 Training with Untranscribed Data
Given an accurate DNN acoustic model and some untranscribed input speech data, we can obtain soft targets for the new data through forward pass. Assuming that the initial model can generalize well on unseen data, the additional soft targets thus generated can be used to augment our original training data. We propose to learn better DNN acoustic models using this augmented training set. This method is reminiscent of the knowledge transfer approach [15, 16] which is typically used for model compression. In this work, we use the same network architecture for all experiments.
DNNs trained with lowrank and sparse soft targets are used to generate soft targets for ICSI corpus and Librispeech (LIB100) which are sources of untranscribed data. Table 1 shows interesting observations from various experiments using data augmentation. First, system2 is built augmenting enhanced AMI training data with ICSI soft targets generated from system1. We consider ICSI corpus, consisting of spontaneous speech from meeting recordings, as indomain with AMI corpus. While PCA based DNN successfully exploits information from this additional ICSI data showing significant improvement from system1 to system2, the same is not observed using sparsity based DNN.
Next, system3 is built by augmenting enhanced AMI data with Librispeech(LIB100) soft targets obtained from system 1. Read audio book speech data from Librispeech is outofdomain as compared to spontaneous speech in AMI. Still, system3 achieves similar reductions in WER as observed in system2 which was built using indomain ICSI data.
System 4 and 5 were built to further explore if we could extract even more information from the outofdomain Librispeech data by using soft targets from system2 instead of system1. Note that system2, trained using soft targets from both AMI and ICSI spontaneous speech data, is a more accurate model than system 1. Indeed, both system 4 and 5 perform better than previous systems using PCA based DNNs where system 5 outperforms the hard target based baseline by 1.5% absolute reduction in WER.
Surprisingly, DNN soft targets obtained from sparse reconstruction are not able to exploit the unseen data in all the systems. We speculate that dictionary learning for sparse coding captures the nonlinearities specific to AMI database. These nonlinear characteristics may correspond to channel and recording conditions which vary over different databases and can not be transcended. On the other hand, the local linearity assumption of PCA leads to extraction of a highly restricted basis set that captures the most important dynamics in the senone probability space. Such regularities mainly address the acoustic dependencies among senones which are generalizable to other acoustic conditions. Hence, the eigenposteriors are invariant to the exceptional effects due to channel and recording conditions.
Sparse reconstruction is able to mitigate the undesired effects as long as they have been seen in the training data. Given the superior performance of sparse reconstruction of AMI posteriors (in system1), we believe that sparse modeling might be more powerful if some labeled data from unseen acoustic conditions is made available for dictionary learning.
It may be noted that training with additional untranscribed data is not effective if nonenhanced soft targets are used. In fact, systems 25 without lowrank or sparse reconstruction, perform worse than system1 although they have seen more training data.
4 Conclusions and Future Directions
We presented a novel approach to improve DNN acoustic model training using lowrank and sparse soft targets. PCA and sparse coding were employed to identify senone subspaces, and enhance senone probabilities through lowdimensional reconstruction. Lowrank reconstruction using PCA relies on the existance of eigenposteriors capturing the local dynamics of senone subspaces. Although, sparse reconstruction proves more effective to achieve reliable soft targets when transcribed data is provided, lowrank reconstruction is found generalizable to outofdomain untranscribed data. DNN trained on lowrank reconstruction acheives 4.6% relative reduction in WER, whereas DNN trained using nonenhanced soft targets fails to exploit additional information from additional data. Eigenposteriors can be better estimated using robust PCA [30] and sparse PCA [31] for better modeling of senone subspaces. Furthermore, probabilistic PCA and maximum likelihood eigen decomposition can reduce the computational cost for large scale applications.
This study supports the use of probabilistic outputs for DNN acoustic modeling. Specifically, enhanced soft targets can be more effective in training small footprint DNNs based on model compression. In future, we plan to investigate their usage in crosslingual knowledge transfer [32]. We will also study domain adaptation based on the notion of eigenposteriors.
5 Acknowledgments
Research leading to these results has received funding from SNSF project on “Parsimonious Hierarchical Automatic Speech Recognition (PHASER)” grant agreement number 200021153507.
References
 [1] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdelrahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
 [2] Steve J Young, Julian J Odell, and Philip C Woodland, “Treebased state tying for high accuracy acoustic modelling,” in Proceedings of the workshop on Human Language Technology. Association for Computational Linguistics, 1994.
 [3] Navdeep Jaitly, Vincent Vanhoucke, and Geoffrey Hinton, “Autoregressive product of multiframe predictions can improve the accuracy of hybrid models,” 2014.
 [4] A. Senior, G. Heigold, M. Bacchiani, and H. Liao, “Gmmfree dnn acoustic model training,” in IEEE ICASSP, 2014.
 [5] Herve Bourlard, Yochai Konig, and Nelson Morgan, REMAP: Recursive Estimation and Maximization of a Posteriori Probabilities; Application to Transitionbased Connectionist Speech Recognition, ICSI Technical Report TR94064, 1994.
 [6] Li Deng, “Switching dynamic system models for speech articulation and acoustics,” in Mathematical Foundations of Speech and Language Processing, pp. 115–133. Springer New York, 2004.
 [7] Simon King, Joe Frankel, Karen Livescu, Erik McDermott, Korin Richmond, and Mirjam Wester, “Speech production knowledge in automatic speech recognition,” The Journal of the Acoustical Society of America, 2007.
 [8] Pranay Dighe, Afsaneh Asaei, and Hervé Bourlard, “Sparse modeling of neural network posterior probabilities for exemplarbased speech recognition,” Speech Communication, 2015.
 [9] Pranay Dighe, Gil Luyet, Afsaneh Asaei, and Herve Bourlard, “Exploiting lowdimensional structures to enhance dnn based acoustic modeling in speech recognition,” in IEEE ICASSP, 2016.
 [10] Gil Luyet, Pranay Dighe, Afsaneh Asaei, and Hervé Bourlard, “Lowrank representation of nearest neighbor phone posterior probabilities to enhance dnn acoustic modeling,” in Interspeech, 2016.

[11]
Jian Xue, Jinyu Li, and Yifan Gong,
“Restructuring of deep neural network acoustic models with singular value decomposition.,”
in INTERSPEECH, 2013.  [12] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran, “Lowrank matrix factorization for deep neural network training with highdimensional output targets,” in IEEE ICASSP, 2013.
 [13] Dong Yu, Frank Seide, Gang Li, and Li Deng, “Exploiting sparseness in deep neural networks for large vocabulary speech recognition,” in IEEE ICASSP, 2012.
 [14] Jian Kang, Cheng Lu, Meng Cai, WeiQiang Zhang, and Jia Liu, “Neuron sparseness versus connection sparseness in deep neural network for large vocabulary speech recognition,” in ICASSP, April 2015, pp. 4954–4958.
 [15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
 [16] William Chan, Nan Rosemary Ke, and Ian Lane, “Transferring knowledge from a rnn to a dnn,” in Interspeech, 2015.
 [17] Ryan Price, Kenichi Iso, and Koichi Shinoda, “Wise teachers train better dnn acoustic models,” EURASIP Journal on Audio, Speech, and Music Processing, , no. 1, pp. 1–19, 2016.
 [18] Brian Hutchinson, Mari Ostendorf, and Maryam Fazel, “A sparse plus lowrank exponential language model for limited resource scenarios,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 494–504, 2015.
 [19] Iain McCowan, Jean Carletta, W Kraaij, S Ashby, S Bourban, M Flynn, M Guillemot, T Hain, J Kadlec, V Karaiskos, et al., “The ami meeting corpus,” in Proceedings of the 5th International Conference on Methods and Techniques in Behavioral Research, 2005, vol. 88.
 [20] Dan Gillick, Larry Gillick, and Steven Wegmann, “Don’t multiply lightly: Quantifying problems with the acoustic model assumptions in speech recognition,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2011.
 [21] Jonathon Shlens, “A tutorial on principal component analysis,” arXiv preprint arXiv:1404.1100, 2014.
 [22] L. Sirovich and M. Kirby, “Lowdimensional procedure for the characterization of human faces,” J. Opt. Soc. Am. A, pp. 519–524, 1987.

[23]
Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro,
“Online learning for matrix factorization and sparse coding,”
Journal of Machine Learning Research (JMLR)
, vol. 11, pp. 19–60, 2010.  [24] Robert Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 267–288, 1996.
 [25] Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, et al., “The icsi meeting corpus,” in IEEE ICASSP, 2003.
 [26] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in IEEE ICASSP, 2015.
 [27] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš Burget, Ondřej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlíček, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” 2011.
 [28] Julien Mairal, Francis Bach, and Jean Ponce, “Sparse modeling for image and vision processing,” arXiv preprint arXiv:1411.3230, 2014.
 [29] I. Himawan, P. Motlicek, D. Imseng, B. Potard, N. Kim, and J. Lee, “Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition,” in IEEE ICASSP, 2015, pp. 4540–4544.
 [30] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma, “Robust recovery of subspace structures by lowrank representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, , no. 99, pp. 1–1, 2013.

[31]
Hui Zou, Trevor Hastie, and Robert Tibshirani,
“Sparse principal component analysis,”
Journal of computational and graphical statistics, vol. 15, no. 2, pp. 265–286, 2006.  [32] Pawel Swietojanski, Arnab Ghoshal, and Steve Renals, “Unsupervised crosslingual knowledge transfer in dnnbased lvcsr,” in IEEE Spoken Language Technology Workshop (SLT), 2012.
Comments
There are no comments yet.