I Introduction and Summary
Deep neural network (DNN) acoustic models have significantly extended the stateoftheart in speech recognition [1]
and are known to be able to learn significant invariances through many layers of nonlinear transformations
[2]. If the training and deployment conditions of the acoustic model are mismatched then the runtime data distribution can differ from the training distribution, bringing a degradation in accuracy, which may be addressed through explicit adaptation to the test conditions [3, 4, 5, 6, 7, 2, 8, 9].In this paper we explore the use of parametrised and differentiable pooling operators for acoustic adaptation. We introduce the approach of differentiable pooling using speakerdependent pooling operators, specifically norm pooling and weighted Gaussian pooling (Section III
), showing how the pooling parameters may be optimised by minimising the negative log probability of the class given the input data (Section
IV), and providing a justification for the use of pooling operators in adaptation (Section V). To evaluate this novel adaptation approach we performed experiments on three corpora – TED talks, Switchboard conversational telephone speech, and AMI meetings – presenting results on using differentiable pooling for speaker independent acoustic modelling, followed by unsupervised speaker adaptation experiments in which adaptation of the pooling operators is compared (and combined) with learning hidden unit contributions (LHUC) [10, 11]and constrained/featurespace maximum likelihood linear regression (fMLLR)
[12].Ii DNN Acoustic Modelling and Adaptation
DNN acoustic models typically estimate the posterior distribution over a set of contextdependent tied states
of a hidden Markov model (HMM)
[13] given an acoustic observation , [14, 15, 1]. The DNN is implemented as a nested function comprising processing layers (nonlinear transformations):(1) 
The model is thus parametrised by a set of weights in which the
th layer consists of a weight matrix and bias vector,
, followed by a nonlinear transformation , acting on arbitrary input :(2) 
To form a probability distribution, the output layer employs a
softmax transformation [16], whereas the hidden layer activation functions are typically chosen to be either sigmoid
or rectified linearunits (ReLU)
[17].Yu et al [2]
experimentally demonstrated that the invariance of the internal representations with respect to variabilities in the input space increases with depth (the number of layers) and that the DNN can interpolate well around training samples but fails to extrapolate if the data mismatch increases. Therefore one often explicitly compensates for unseen variabilities in the acoustic space.
Featurespace normalisation increases the invariance to unseen data by transforming the data such that it better matches the training data. In this approach the DNN learns an additional transform of the input features conditioned on the speaker or the environment. The transform, which is typically affine, is parametrised by an additional set of adaptation parameters. The most effective form of featurespace normalisation is constrained (featurespace) maximumlikelihood linear regression (MLLR), referred to as fMLLR [12]
, in which the linear transform parameters are estimated by maximising the likelihood of the adaptation data under a Gaussian Mixture Model (GMM) / HMM acoustic model. To use fMLLR with a DNN acoustic model it is necessary to estimate a single input transform per speaker (using a trained GMM), using the resultant transformed data to train a DNN in a speaker adaptive training (SAT) manner. At runtime another set of fMLLR parameters is estimated for each speaker and the data transformed accordingly. This technique has consistently and significantly reduced the word error rate (WER) across several different benchmarks for both hybrid
[14, 1] and tandem[18, 19] approaches. There are many successful examples of fMLLR adaptation of DNN acoustic models [20, 6, 21, 1, 22, 23, 24, 8, 25]. One can also estimate the linear transform as an input layer of the DNN, often referred to as a linear input network (LIN) [3, 26, 4, 6]. LINbased approaches have been mostly used in testonly adaptation schemes, whereas fMLLR requires SAT, but usually results in lower WERs.Auxiliary feature approaches augment the acoustic feature vectors with additional speakerdependent information computed for each speaker at both training and runtime stages – this is a form of SAT in which the model learns the distribution over tied states conditioned on some additional speakerspecific information. There has been considerable recent work exploring the use of ivectors [27] for this purpose. Ivectors, which can be regarded as basis vectors spanning a subspace of speaker variability, were first used for adaptation in a GMM framework by Karafiat et al [28], and were later successfully employed for DNN adaptation [29, 30, 31, 32, 33, 34]. Other examples of auxiliary features include the use of speakerspecific bottleneck features obtained from a speaker separation DNN [35], the use of outofdomain tandem features [24], as well as speaker codes [36, 37, 38] in which a specific set of units for each speaker is optimised. Kundu et al. [39] present an approach using auxiliary input features derived from the bottleneck layer of a DNN which is combined with ivectors.
Modelbased approaches adapt the DNN parameters using data from the target speaker. Liao [40] investigated this approach in both supervised and unsupervised settings using a few minutes of adaptation data. On a large DNN, when all weights were adapted, up to 5% relative improvement was observed for unsupervised adaptation, using a speaker independent decoding to obtain DNN targets. Yu et al [9] have explored the use of regularisation for adapting the weights of a DNN, using the KullbackLiebler (KL) divergence between the output distributions produced by speakerindependent and the speakeradapted models. This approach was also recently used to adapt parameters of sequencetrained models [41]. LIN may also be regarded as a form of modelbased adaptation, and related approaches include adaptation using a linear output network (LON) or linear hidden network (LHN) [4, 7, 42].
Directly adapting all the weights of a large DNN is computationally and data intensive, and results in large speakerdependent parameter sets. Smaller subsets of the DNN weights may be modified, including biases and slopes of hidden units [43, 7, 44, 34]. Another recently developed approach relies on learning hidden unit contributions (LHUC) for testonly adaptation [10, 11] as well as in a SAT framework [45]. One can also adapt the top layer using Bayesian methods resulting in a maximum a posteriori (MAP) approach [46], or address the sparsity of contextdependent tiedstates when few adaptation datapoints are available by modelling both monophones and contextdependent tied states using multitask adaptation [47, 48] or a hierarchical output layer [49].
Iii Differentiable Pooling
Building on our initial work [50], we present an approach to adaptation by learning hidden layer pooling operators with parameters that can be learned and adapted in a similar way to the other model parameters. The idea of feature pooling originates from Hubel and Wiesel’s pioneering study on visual cortex in cats [51]
, and was first used in computer vision to combine spatially local features
[52]. Pooling in DNNs involves the combination of a set of hidden unit outputs into a summary statistic. Fixed poolings are typically used, such as average pooling (used in the original formulation of convolutional neural networks – CNNs)
[53, 54]and max pooling (used in the context of feature hierarchies
[55] and later applied to CNNs [56, 57]).Reducing the dimensionality of hidden layers by pooling some subsets of hidden unit activations has become well investigated beyond computer vision, and the operator has been interpreted as a way to learn piecewise linear activation functions – referred to as Maxout [58]. Maxout has been widely investigated for both fullyconnected [59, 60, 61] and convolutional [62, 63] DNNbased acoustic models. Max pooling, although differentiable, performs a onefrom selection, and hence does not allow hidden unit outputs to be interpolated, or their combination to be learned within a pool.
There have been a number of approaches to pooling with differentiable operators – differentiable pooling – a notion introduced by Zeiler and Fergus [64]
in the context of constructing unsupervised feature extract for support vector machines in computer vision tasks. There has been some interest in the use of
norm pooling with CNN models [57, 65] in which the sufficient statistic is the norm of the group of (spatiallyrelated) hidden unit activations. Fixed order norm pooling was recently applied within the context of a convolutional neural network acoustic model [66], where it did not reduce the WER over maxpooling, and as an activation function in a fullyconnected DNNs [67], where it was found to improve over maxout and ReLU.Iiia norm (Diff) pooling
In this approach we pool a set of activations using an norm. A hidden unit pool is formed by a set of affine projections which form the input to the th pooling unit, which we write as an ordered set (vector) . The output of the th pooling unit is produced as an norm:
(3) 
where is the learnable norm order for the th unit, that can be jointly optimised with the other parameters in the model. To ensure that (3) satisfies a triangle inequality (; a necessary property of the norm), during optimisation is reparametrised as , where is the actual learned parameter. For the case when we obtain the maxpooling operator [55]:
(4) 
Similarly, if we obtain absolute average pooling (assuming the pool is normalised by ). We refer to this model as Diff, and it is parametrised by . Serement et al [65] investigated fixedorder pooling for image classification, which was applied to speaker independent acoustic modelling [67]. Here we allow each unit in the model to have a learnable order [69], and we use the pooling parameters to perform modelbased testonly acoustic adaptation.
IiiB Gaussian kernel (DiffGauss) pooling
The second pooling approach estimates the pooling coefficients using a Gaussian kernel. We generate the pooling inputs at each layer as:
(5) 
where is a nonlinearity ( in this work) and is a set of affine projections as before. A nonlinearity is essential as otherwise (contrary to pooling) we would produce a linear transformation through a linear combination of linear projections. is the pool amplitude; this parameter is tied and learned perpool as this was found to give similar results to perunit amplitudes (but with fewer parameters), and better results compared to setting to a fixed value [50].
Given the activation (5), the pooling operation is defined as a weighted average over a set of hidden units, where the th pooling unit is expressed as:
(6) 
The pooling contributions are normalised to sum to one within each pooling region (7) and each weight is coupled with the corresponding value of by a Gaussian kernel (8) (one per pooling unit) parameterised by the mean and precision, :
(7) 
(8) 
Similar to norm pooling, this formulation allows a generalised pooling to be learned – from average () to () – separately for each pooling unit within a model. The DiffGauss model is thus parametrised by .
Iv Learning Differentiable Poolers
We optimise the acoustic model parameters by minimising the negative log probability of the target HMM tied state given the acoustic observations using gradient descent and error backpropagation [70]; the pooling parameters may be updated in a speakerdependent manner, to adapt the acoustic model to unseen data. In this section we give the necessary partial derivatives for Diff and DiffGauss pooling.
Iva Learning and adapting Diff pooling
In Diff pooling we learn which we express in terms of , . Error backpropagation requires the partial derivative of the pooling region with respect to , which is given as:
(9)  
where when and 0 otherwise. The backpropagation through the norm itself is implemented as:
(10) 
where represents the elementwise Hadamard product, and is a vector of activations repeated times, so the resulting operation can be fully vectorised:
(11) 
Normalisation by in (3) is optional (see also Section VIIA) and the partial derivatives in (9) and (10) hold for the unnormalised case also: the effect of this is taken into account in the forward activation .
Since (9) and (10) are not continuous everywhere, they need to be stabilised when . When computing logarithm in the numerator of (9) it is also necessary to ensure that each . In practise, we threshold each element to have at least a value if . Note, this numerical stabilisation of only applies to units, not DiffGauss.
IvB Learning and adapting DiffGauss pooling regions
To learn the DiffGauss pooling parameters , we require the partial derivatives and to update pooling parameters, as well as in order to backpropagate error signals to lower layers.
One can compute the partial derivative of (6) with respect to the input activations as:
(12) 
where is the Jacobian representing the partial derivative :
(13) 
whose elements can be computed as:
(14) 
(15) 
Likewise, represents the Jacobian of the kernel function in (8) with respect to :
(16) 
and the elements of can be computed as:
(17) 
Similarly, one can obtain the gradients with respect to the pooling parameters . In particular, for , the gradient is:
(18) 
where = and is:
(19) 
The corresponding gradient for is obtained below (IVB). Notice, that (17) and (21) are symmetric, hence , and to compute one can reuse the term in (12), as follows:
(20) 
(21) 
V Representational efficiency of pooling units
The aim of modelbased DNN adaptation is to alter the learned speaker independent representation in order to improve the classification accuracy for data from a possibly mismatched test distribution. Owing to the highly distributed representations that are characteristic of DNNs, it is rarely clear which parameters should be adapted in order generalise well to a new speaker or acoustic condition.
Pooling enables decision boundaries to be altered, through the selection of relevant hidden features, while keeping the parameters of the feature extractors (the hidden units) fixed: this is similar to LHUC adaptation [68]. The pooling operators allow for a geometrical interpretation of the decision boundaries and how they will be affected by a constrained adaptation – the units within the pool are jointly optimised given the pooling parametrisation, and share some underlying relationship within the pool.
This is visualised for units in Fig. 2. Fig. 2 (a) illustrates the unit circles obtained by solving for different orders , with and a pool of linear inputs . Such an unit is capable of closedregion decision boundaries, illustrated in Fig. 2 (b). The distance threshold is implicitly learned from data (through the parameters given ), resulting in an efficient representation [69, 67] compared with representing such boundaries using sigmoid units or ReLUs, which would require more parameters. Figs. 2 (c) and (d) show how those boundaries are affected when (average pooling) and (max pooling), while keeping fixed. As shown in Section VII we found that updating is an efficient and relatively lowdimensional way to adjust decision boundaries such that the the model’s accuracy on the adaptation data distribution improves.
It is also possible to update the biases (Fig. 2 (e), red contours) and the LHUC amplitudes (Fig. 2 (f), red contours). We experimentally investigate how each approach impacts adaptation WER in Section VIIB. Although models implementing DiffGauss units are theoretically less efficient in terms of SI representations compared to units, and comparable to standard fullyconnected models, the pooling mechanism still allows for more efficient (in terms of number of SD parameters) speaker adaptation.
Vi Experimental setups
We have carried out experiments on three corpora: the TED talks corpus [71] following the IWSLT evaluation protocol (www.iwslt.org); the Switchboard corpus of conversational telephone speech [72] (ldc.upenn.edu) and the AMI meetings corpus [73, 74] (corpus.amiproject.org
). Unless explicitly stated otherwise, our baseline models share similar structure across the tasks – DNNs with 6 hidden layers (2,048 units per layer) using a sigmoid nonlinearity. The output softmax layer models the distribution of contextdependent clustered tied states
[75]. The features are presented in 11 () frame long context windows. All the adaptation experiments, if not stated otherwise, were performed unsupervised using adaptation targets obtained from firstpass speakerindependent decoding of the corresponding SI system.TED: The training data consisted of 143 hours of speech (813 talks) and the systems follow our previously described recipe [8]. However, compared to our previous work [8, 11, 50], our systems here make use of more accurate language models developed for our IWSLT–2014 systems [76]: in particular, the final reported results use a 4gram language model estimated from 751 million words. The baseline TED acoustic models were trained on unadapted PLP features with first and second order time derivatives. We present results on four IWSLT test sets: dev2010, tst2010, tst2011 and tst2013 containing 8, 11, 8, and 28 talks respectively.
AMI: We follow a Kaldi GMM recipe [77] and use the individual headset microphone (IHM) recordings. On this corpus, we train the acoustic models using 40 melfilterbank (FBANK) features. We decode with a pruned 3gram language model estimated from 800k words of AMI training transcripts interpolated with an LM trained on Fisher conversational telephone speech transcripts (1M words) [78].
Switchboard (SWBD): We follow a Kaldi GMM recipe [79, 80]^{1}^{1}1To stay compatible with our previous adaptation work on Switchboard [45, 68] we are using the older set of Kaldi recipe scripts called s5b, and our baseline results are comparable with the corresponding baseline numbers previously reported. A newer set of improved scripts exists under s5c which, in comparison to s5b, offer about 1.5% absolute lower WER., using Switchboard–1 Release 2 (LDC97S62). Our baseline unadapted acoustic models were trained on MFCC features, while the SAT trained fMLLR variants utilise the usual Kaldi feature preprocessing pipeline, which is MFCC+LDA/MLLT+fMLLR^{2}^{2}2MFCCMelfrequency Cepstral Coefficients, LDA  Linear Discriminant Analysis, MLLT  Maximum Likelihood Linear Transform. The results are reported on the full Hub5’00 set (LDC2002S09) – eval2000. eval2000 contains two types of data: Switchboard – which is better matched to the training data; and CallHome (CHE) English. Our reported results use 3gram LMs estimated from the Switchboard and Fisher Corpus transcripts.
Vii Results
Viia Baseline speaker independent models
The structures of the differentiable pooling models were selected such that the number of parameters was comparable to the corresponding baseline DNN models, described in detail in [68]. For the Diff and Diff types, the resulting models utilised nonoverlapping pooling regions of size , with 900 norm units per layer. The DiffGauss models had pool sizes set to (this was found to work best in our previous work [50]) which (assuming nonoverlapping regions) results in 1175 pooling units per layer.
Training speaker independent Diff and Diff models: For both Diff and Diff we trained with an initial learning rate of .008 (for MFCC, PLP, FBANK features) and .006 (for fMLLR features). The learning rate was adjusted using the newbob learning scheme [81] based on the validation frame error rate. We found that applying explicit pool normalisation (dividing by in (3)) gives consistently higher error rates (typically an absolute increase of 0.3% WER): hence we used unnormalised units in all experiments. We did not apply postlayer normalisation [67]. Instead, we use maxnorm approach – after each update we scaled the columns (i.e. each ) of the fully connected weight matrices such that their norms were below a given threshold (set to in this work) [82]. For Diff models we initialised . Those parameters were optimised on TED and directly applied without further tuning for the other two corpora. In this work we have focussed on adaptation; Zhang et al [67] have reported further speaker independent experiments for fixed order units.
Training speaker independent DiffGauss models: The initial learning rate was set to 0.08 (regardless of the feature type), again adjusted using newbob
. Initial pooling parameters were sampled randomly from normal distribution:
and . Otherwise, the hyperparameters were the same as for the baseline DNN models.Baseline speaker independent results: Table I gives speaker independent results for each of the considered model types. The DiffGauss and Diff/Diff models have comparable WERs, with a small preference towards Diff in terms of the final WER on TED and AMI; all have lower average WER than the baseline DNN. The gap between the pooled models increases on AMI data where Diff has a substantially lower WER (3.2% relative) than the fixed order Diff which is in turn has a lower WER than the other two models (DiffGauss and baseline DNN) by 2.1% relative.
Fig. 3 gives more insight into the Diff models by showing how the final distributions of the learned order differ across AMI, TED and SWBD corpora. deviates more from its initialisation in the lower layers of the model; there is also a difference across corpora. This follows the intuition of how a multilayer network builds its representation: lower layers are more dependent on acoustic variabilities, normalising for such effects, and hence feature extractors may differ across datasets – in contrast to the upper layers which rely on features abstracted away from the acoustic data. For these corpora, the order rarely exceeded 3, sometimes dropping below 2 – especially for layer 1 with SWBD data. However, most units, especially in higher layers, tend to have . This corresponds to previous work [67] in which fixed units tended to obtain lower WER. A similar analysis of DiffGauss pooling does not show large datadependent differences in the learned pooling parameters.
Training speed: Table II shows the average training speeds for each of the considered models. Training pooling units is significantly more expensive than training baseline DNN models. This is to be expected as the pooling operations cannot be easily and fully vectorised. In our implementation training the DiffGauss or Diff models is about 40% slower than training a baseline DNN. Not optimising during training (9) decreases the gap to about 20% slower. This indicates that training using fixed units, and then adapting the order in a speaker adaptive manner could make a good compromise.
ViiB Adaptation experiments
We initially used the TED talks corpus to investigate how WERs are affected by adapting different layers in the model. The results indicated that adapting only the bottom layer brings the largest drop in WER; however, adapting more layers further improves the accuracy for both Diff and DiffGauss models (Fig. 4 (a)). Since obtaining the gradients for the pooling parameters at each layer is inexpensive compared to the overall backpropagation, and adapting bottom layer gives largest gains, in the remainder of this work we adapt all pooling units. Similar trends hold when pooling adaptation is combined with LHUC adaptation, which on tst2010 improves the accuracies by 0.20.3% absolute.
Fig. 4
(b) shows WER vs. the number of adaptation iterations. The results indicate that one adaptation iteration is sufficient and, more importantly, the model does not overfit when more iterations are used. This suggests that it is not necessary to regularise the model carefully (by KullbackLeibler divergence
[9], for instance) which is usually required when weights that directly transform the data are adapted. In the remainder, we adapt all models with a learning rate of for three iterations (optimised on dev2010).Table III shows the effect of adapting different pooling parameters (including LHUC amplitudes) for units. Updating only , rather than any other standalone pooling parameter, gives a lower WER than LHUC adaptation with the same number of parameters (cf Fig. 2); however, updating both brings further reductions in WER. Adapting the bias is more datadependent with a substantial increase in WER for SWBD; this also significantly increases the number of adapted parameters. Hence we adapted either alone, or with LHUC in the remaining experiments
Table IV shows similar analysis but for DiffGauss model. For DiffGauss, it is beneficial to update both and (as in [50]), and LHUC was also found to be complementary. Notice, adapting with LHUC scalers is similar to altering in eq. (5) (assuming is tied per pool, as mentioned in Section IIIB). As such, new parameters need not be introduced to adapt DiffGauss with LHUC as it is the case for Diff units. In fact, last two rows of Table IV show that jointly updating , and gives lower WER than updating , and applying LHUC after pooling (see Fig. 1).
Analysis of Diff: Fig. 5 shows how the distribution of changes after the Diff model adapts to each of the 28 speakers of tst2013. We plot the speaker independent histograms as well as the contours of the mean bin frequencies for each layer. For the adapted models the distributions of become less dispersed, especially in higher layers, which can be interpreted as shrinking the decision regions of particular units (cf Fig. 2). This follows the intuition that speaker adaptation involves reducing the variability that needs to be modelled, in contrast to the speaker independent model.
Taking into account the increased training time of Diff models, one can also consider training fixed order Diff [67], adapting using (9). The results in Fig. 5, as well as later results, cover this scenario. The adapted Diff models display a similar trend in the distribution of to the Diff models.
Analysis of DiffGauss: We performed a similar investigation on the learned DiffGauss pooling parameters (Fig. 6). In the bottom layers they are characterised by a large negative means and positive precisions which has the effect of turning off many units. After adaptation, some of them become more active, which can be seen based on shifted distributions of adapted pooling parameters in Fig. 6. The adaptation with DiffGauss has a similar effect as the adaptation of slopes and amplitudes [44, 83], but adapts times fewer parameters.
Amount of adaptation data and quality of targets: We investigated the effect of the amount of adaptation data by randomly selecting adaptation utterances from tst2010 to give totals of 10s, 30s, 60s, 120s, 300s and more speakerspecific adaptation data per talker (Fig. 7 (a)). The WERs are an average over three independent runs, each sampling a different set of adaptation utterances (we did more passes in our previous work [11, 50], however, both LHUC and differentiable pooling operators were not sensitive to this aspect, resulting in small error bars between different results obtained with different random utterances). The Diff models offer lower WER and more rapid adaptation, with 10s of adaptation data resulting in a decrease in WER by 0.6% absolute (3.6% relative) which further increases up to 2.1% absolute (14.4% relative) when using all the speaker’s data in an unsupervised manner. DiffGauss is comparable in terms of WER to a DNN adapted with LHUC. In addition, both methods are complementary to LHUC adaptation, and to featurespace adaptation with fMLLR (Tables VI and VII).
In order to demonstrate the modelling capacities of the different modelbased adaptation techniques, we carried out a supervised adaptation (oracle) experiment in which the adaptation targets were obtained by aligning the audio data with reference transcripts (Fig. 7
(b)). We do not refine what the model knows about speech, nor the way it classifies it (the feature receptors and output layer are fixed during adaptation and remain speaker independent), but show that the recomposition and interpolation of these basis functions to approximate the unseen distribution of adaptation data is able to decrease the WER by 26.7% relative for
Diff + LHUC scenario.The methods are also not very sensitive to the quality of adaptation targets, they show very similar trends as LHUC, for which exact results for different qualities of adaptation targets resulting from rescoring adaptation hypotheses with different language models were reported in [68].
Summary: Results for the proposed techniques are summarised in Tables V, VI, and VII for AMI, TED, and SWBD, respectively. The overall observed trends are as follows: (I) speaker independent pooling models return lower WERs than the baseline DNNs: DiffGauss Diff Diff (although the last two seem to be datadependent); (II) the pooling models (DiffGauss, Diff and Diff) are complementary to both fMLLR and LHUC adaptation – as expected, the final gain depends on the degree of data mismatch; (III) one can effectively train speaker independent Diff models and later alter in a speaker dependent manner; (IV) the average relative improvement across all tasks with respect to baseline unadapted DNN models were 6.8% for DiffGauss, 9.1% for Diff and 10.4% for Diff; and (V) when comparing LHUC adapted DNN to LHUC adapted differentiable pooling models, the relative reductions in WER for the pooling models were 2%, 3.4% and 4.8% for DiffGauss, Diff and Diff, respectively.
Viii Discussion and Conclusions
We have proposed the use of differentiable pooling operators with DNN acoustic models to perform unsupervised speaker adaptation. Differentiable pooling operators offer a relativelylow dimensional set of parameters which may be adapted in a speakerdependent fashion.
We investigated the complementarity of differentiable pooling adaptation with two other approaches – modelbased LHUC adaptation and featurespace fMLLR adaptation. We have not performed an explicit comparison with an ivector approach to adaptation. However, some recent papers have compared ivector adaptation with either LHUC and/or fMLLR on similar data which enables us some make indirect comparisons. For example, Samarakoon and Sim [34] showed that speakeradaptive training with ivectors gives a comparable results to testonly LHUC using TED data, and Miao et. al [33] suggested that LHUC is better than a standard use of ivectors (as in Saon et al. [29]) on TED data, with a more sophisticated ivector postprocessing needed to equal LHUC. Since the proposed Diff and DiffGauss techniques resulted in WERs that were at least as good as LHUC (and were found to be complementary to fMLLR) we conclude that the proposed poolingbased adaptation techniques are competitive.
In the future, one could investigate extending the proposed techniques to speaker adaptive training (SAT) [84, 85], for example in a similar spirit as proposed in the context of SATLHUC [45]. In addition it would be interesting to investigate the suitability of adapting pooling regions in the framework of sequence discriminative training [86, 87, 79]. Our experience of LHUC in this framework [68], together with the observation that the pooling models are not prone to overfitting in the case of small amounts of adaptation data, suggests that adaptation based on differentiable pooling is a promising technique for sequence trained models.
Acknowledgement
The NST research data collection may be accessed at http://datashare.is.ed.ac.uk/handle/10283/786. This research utilised a K40 GPGPU board donated by NVIDA Corporation. The authors would like to thank the reviewers for insightful comments that helped to improve the manuscript.
References
 [1] G Hinton, L Deng, D Yu, GE Dahl, A Mohamed, N Jaitly, A Senior, V Vanhoucke, P Nguyen, TN Sainath, and B Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, Nov 2012.
 [2] D Yu, M Seltzer, J Li, JT Huang, and F Seide, “Feature learning in deep neural networks  studies on speech recognition,” in Proc. ICLR, 2013.
 [3] J Neto, L Almeida, M Hochberg, C Martins, L Nunes, S Renals, and T Robinson, “Speaker adaptation for hybrid HMM–ANN continuous speech recognition system,” in Proc. Eurospeech, 1995, pp. 2171–2174.
 [4] B Li and KC Sim, “Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems,” in Proc. Interspeech, 2010.
 [5] J Trmal, J Zelinka, and L Müller, “On speaker adaptive training of artificial neural networks,” in Proc. Interspeech, 2010.
 [6] F Seide, X Chen, and D Yu, “Feature engineering in contextdependent deep neural networks for conversational speech transcription,” in Proc. IEEE ASRU, 2011.
 [7] K Yao, D Yu, F Seide, H Su, L Deng, and Y Gong, “Adaptation of contextdependent deep neural networks for automatic speech recognition.,” in Proc. IEEE SLT, 2012, pp. 366–369.
 [8] P Swietojanski, A Ghoshal, and S Renals, “Revisiting hybrid and GMMHMM system combination techniques,” in Proc. IEEE ICASSP, 2013, pp. 6744–6748.
 [9] D Yu, K Yao, H Su, G Li, and F Seide, “KLdivergence regularized deep neural network adaptation for improved large vocabulary speech recognition.,” in Proc. IEEE ICASSP, 2013, pp. 7893–7897.
 [10] O AbdelHamid and H Jiang, “Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition.,” in Proc. ICSA Interspeech, 2013, pp. 1248–1252.
 [11] P Swietojanski and S Renals, “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models,” in Proc. IEEE SLT, 2014.
 [12] MJF Gales, “Maximum likelihood linear transformations for HMMbased speech recognition,” Computer Speech and Language, vol. 12, pp. 75–98, April 1998.
 [13] SJ Young and PC Woodland, “State clustering in hidden Markov modelbased continuous speech recognition,” Computer Speech and Language, vol. 8, no. 4, pp. 369–383, 1994.
 [14] H Bourlard and N Morgan, Connectionist Speech Recognition: A Hybrid Approach, Kluwer Academic Publishers, 1994.
 [15] S Renals, N Morgan, H Bourlard, M Cohen, and H Franco, “Connectionist probability estimators in HMM speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 2, pp. 161–174, 1994.

[16]
JS Bridle,
“Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,”
in Neurocomputing, F Fogelman Soulié and J Hérault, Eds., pp. 227–236. Springer, 1990. 
[17]
V Nair and G Hinton,
“Rectified linear units improve restricted Boltzmann machines,”
in Proc. ICML, 2010, pp. 131–136.  [18] H Hermansky, DPW Ellis, and S Sharma, “Tandem connectionist feature extraction for conventional HMM systems,” in Proc. IEEE ICASSP, 2000, pp. 1635–1638.
 [19] F Grezl, M Karafiat, S Kontar, and J Cernocky, “Probabilistic and bottleneck features for LVCSR of meetings,” in Proc. IEEE ICASSP, 2007, pp. IV–757–IV–760.

[20]
A Mohamed, TN Sainath, G Dahl, B Ramabhadran, GE Hinton, and MA Picheny,
“Deep belief networks using discriminative features for phone recognition,”
in Proc. IEEE ICASSP, 2011, pp. 5060–5063.  [21] T Hain, L Burget, J Dines, PN Garner, F Grézl, A El Hannani, M Karafíat, M Lincoln, and V Wan, “Transcribing meetings with the AMIDA systems,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, pp. 486–498, 2012.
 [22] TN Sainath, B Kingsbury, and B Ramabhadran, “Autoencoder bottleneck features using deep belief networks.,” in Proc. IEEE ICASSP, 2012, pp. 4153–4156.
 [23] TN Sainath, A Mohamed, B Kingsbury, and B Ramabhadran, “Deep convolutional neural networks for lvcsr,” in Proc. IEEE ICASSP, 2013, pp. 8614–8618.
 [24] P Bell, P Swietojanski, and S Renals, “Multilevel adaptive networks in tandem and hybrid ASR systems,” in Proc. IEEE ICASSP, 2013.
 [25] T Yoshioka, A Ragni, and MJF Gales, “Investigation of unsupervised adaptation of dnn acoustic models with filter bank input,” in Proc. IEEE ICASSP, 2014, pp. 6344–6348.
 [26] V Abrash, H Franco, A Sankar, and M Cohen, “Connectionist speaker normalization and adaptation,” in Proc. Eurospeech, 1995, pp. 2183–2186.
 [27] N Dehak, PJ Kenny, R Dehak, P Dumouchel, and P Ouellet, “Front end factor analysis for speaker verification,” IEEE Trans Audio, Speech and Language Processing, vol. 19, pp. 788–798, 2010.
 [28] M Karafiat, L Burget, P Matejka, O Glembek, and J Cernozky, “Ivectorbased discriminative adaptation for automatic speech recognition,” in Proc. IEEE ASRU, 2011.
 [29] G Saon, H Soltau, D Nahamoo, and M Picheny, “Speaker adaptation of neural network acoustic models using ivectors.,” in Proc. IEEE ASRU, 2013, pp. 55–59.
 [30] A Senior and I LopezMoreno, “Improving DNN speaker independence with ivector inputs.,” in Proc. IEEE ICASSP, 2014, pp. 225–229.
 [31] V Gupta, P Kenny, P Ouellet, and T Stafylakis, “Ivector based speaker adaptation of deep neural networks for french broadcast audio transcription,” in Proc. IEEE ICASSP, 2014.
 [32] P Karanasou, Y Wang, MJF Gales, and PC Woodland, “Adaptation of deep neural network acoustic models using factorised ivectors,” in Proc. ICSA Interspeech, 2014, pp. 2180–2184.
 [33] Y Miao, H Zhang, and F Metze, “Speaker adaptive training of deep neural network acoustic models using ivectors,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 11, pp. 1938–1949, Nov 2015.
 [34] L Samarakoon and K C Sim, “On combining ivectors and discriminative adaptation methods for unsupervised speaker normalization in dnn acoustic models,” in Proc. IEEE ICASSP, 2016, pp. 5275–5279.
 [35] Y Liu, P Zhang, and T Hain, “Using neural network frontends on far field multiple microphones based speech recognition,” in Proc. IEEE ICASSP, 2014, pp. 5542–5546.
 [36] JS Bridle and S Cox, “Recnorm: Simultaneous normalisation and classification applied to speech recognition,” in Advances in Neural Information and Processing Systems, 1990.
 [37] O AbdelHamid and H Jiang, “Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code,” in Proc. IEEE ICASSP, 2013, pp. 4277–4280.
 [38] S Xue, O AbdelHamid, J Hui, L Dai, and Q Liu, “Fast adaptation of deep neural network based on discriminant codes for speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1713–1725, Dec 2014.
 [39] S Kundu, G Mantena, Y Qian, T Tan, M Delcroix, and KC Sim, “Joint acoustic factor learning for robust deep neural network based automatic speech recognition,” in Proc. IEEE ICASSP, March 2016, pp. 5025–5029.
 [40] H Liao, “Speaker adaptation of context dependent deep neural networks.,” in Proc. IEEE ICASSP, 2013, pp. 7947–7951.
 [41] Y Huang and Y Gong, “Regularized sequencelevel deep neural network model adaptation,” in Proc. ICSA Interspeech, 2015, pp. 1081–1085.
 [42] T Ochiai, S Matsuda, X Lu, C Hori, and S Katagiri, “Speaker adaptive training using deep neural networks,” in Proc. IEEE ICASSP, 2014, pp. 6349–6353.
 [43] SM Siniscalchi, J Li, and CH Lee, “Hermitian polynomial for speaker adaptation of connectionist speech recognition systems,” IEEE Trans Audio, Speech, and Language Processing, vol. 21, pp. 2152–2161, 2013.
 [44] Y Zhao, J Li, J Xue, and Y Gong, “Investigating online lowfootprint speaker adaptation using generalized linear regression and clickthrough data,” in Proc. IEEE ICASSP, 2015, pp. 4310–4314.
 [45] P Swietojanski and S Renals, “SATLHUC: Speaker adaptive training for learning hidden unit contributions,” in Proc. IEEE ICASSP, 2016, pp. 5010–5014.
 [46] Z Huang, S M Siniscalchi, IF Chen, J Wu, and CH Lee, “Maximum aposteriori adaptation of network parameters in deep models,” arXiv preprint arXiv:1503.02108, 2015.
 [47] Z Huang, J Li, SM Siniscalchi, IF Chen, J Wu, and CH Lee, “Rapid adaptation for deep neural networks through multitask learning,” in Proc. ICSA Interspeech, 2015, pp. 3625–3629.
 [48] P Swietojanski, P Bell, and S Renals, “Structured output layer with auxiliary targets for contextdependent acoustic modelling,” in Proc. ICSA Interspeech, 2015, pp. 3605–3609.
 [49] R Price, K Iso, and K Shinoda, “Speaker adaptation of deep neural networks using a hierarchy of output layers,” in Proc. IEEE SLT, 2014.
 [50] P Swietojanski and S Renals, “Differentiable pooling for unsupervised speaker adaptation,” in Proc. IEEE ICASSP, 2015, pp. 4305–4309.
 [51] D Hubel and T Wiesel, “Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex,” Journal of Physiology, vol. 160, pp. 106–154, 1962.
 [52] K Fukushima and S Miyake, “Neocognitron: A new algoriothm for pattern recognition tolerant of deformations,” Pattern Recognition, vol. 15, pp. 455–469, 1982.

[53]
Y LeCun, B Boser, JS Denker, D Henderson, RE Howard, W Hubbard, and
LD Jackel,
“Backpropagation applied to handwritten zip code recognition,”
Neural Computation, vol. 1, pp. 541–551, 1989.  [54] Y LeCun, L Bottou, Y Bengio, and P Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, pp. 2278–2324, 1998.
 [55] M Riesenhuber and T Poggio, “Hierarchical models of object recognition in cortex,” Nature Neuroscience, vol. 2, pp. 1019–1025, 1999.

[56]
MA Ranzato, FJ Huang, YL Boureau, and Y LeCun,
“Unsupervised learning of invariant feature hierarchies with applications to object recognition,”
in Proc. IEEE CVPR, 2007.  [57] YL Boureau, J Ponce, and Y LeCun, “A theoretical analysis of feature pooling in visual recognition,” in Proc. ICML, 2010.
 [58] IJ Goodfellow, D WardeFarley, M Mirza, A Courville, and Y Bengio, “Maxout networks,” in Proc. ICML, 2013, pp. 1319–1327.
 [59] Y Miao, F Metze, and S Rawat, “Deep maxout networks for lowresource speech recognition,” in Proc. IEEE ASRU, 2013.
 [60] M Cai, Y Shi, and J Liu, “Deep maxout neural networks for speech recognition,” in Proc. IEEE ASRU, 2013, pp. 291–296.
 [61] P Swietojanski, J Li, and JT Huang, “Investigation of maxout networks for speech recognition,” in Proc. IEEE ICASSP, 2014.
 [62] S Renals and P Swietojanski, “Neural networks for distant speech recognition,” in Proc. HSCMA, 2014.
 [63] L Toth, “Convolutional deep maxout networks for phone recognition,” in Proc. ICSA Interspeech, 2014.
 [64] MD Zeiler and R Fergus, “Differentiable pooling for hierarchical feature learning,” CoRR, vol. abs/1207.0151, 2012.
 [65] P Sermanet, S Chintala, and Y LeCun, “Convolutional neural networks applied to house numbers digit classification,” CoRR, vol. abs/1204.3968, 2012.
 [66] TN Sainath, B Kingsbury, A Mohamed, GE Dahl, G Saon, H Soltau, T Beran, AY Aravkin, and B Ramabhadran, “Improvements to deep convolutional neural networks for LVCSR,” in Proc. IEEE ASRU, 2013, pp. 315–320.
 [67] X Zhang, J Trmal, D Povey, and S Khudanpur, “Improving deep neural network acoustic models using generalized maxout networks,” in Proc. IEEE ICASSP, 2014.
 [68] P Swietojanski, J Li, and S Renals, “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1450–1463, 2016.

[69]
C Gülçehre, K Cho, R Pascanu, and Y Bengio,
“Learnednorm pooling for deep feedforward and recurrent neural networks,”
in Proc. ECML and KDD. 2014, pp. 530–546, SpringerVerlag.  [70] DE Rumelhart, GE Hinton, and RJ Williams, “Learning internal representations by errorpropagation,” in Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press, 1986.
 [71] M Cettolo, C Girardi, and M Federico, “Wit: Web inventory of transcribed and translated talks,” in Proc. EAMT, 2012, pp. 261–268.
 [72] JJ Godfrey, EC Holliman, and J McDaniel, “SWITCHBOARD: Telephone speech corpus for research and development,” in Proc. IEEE ICASSP. IEEE, 1992, pp. 517–520.
 [73] J Carletta, “Unleashing the killer corpus: Experiences in creating the multieverything AMI meeting corpus.,” Language Resources and Evaluation, vol. 41, no. 2, pp. 181–190, 2007.
 [74] S Renals, T Hain, and H Bourlard, “Recognition and understanding of meetings: The AMI and AMIDA projects,” in Proc. IEEE ASRU, Kyoto, 12 2007, IDIAPRR 0746.
 [75] GE Dahl, D Yu, L Deng, and A Acero, “Contextdependent pretrained deep neural networks for largevocabulary speech recognition,” IEEE Transaction on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
 [76] P Bell, P Swietojanski, J Driesen, M Sinclair, F McInnes, and S Renals, “The UEDIN system for the IWSLT 2014 evaluation,” in Proc. IWSLT, 2014, pp. 26–33.
 [77] P Swietojanski, A Ghoshal, and S Renals, “Hybrid acoustic models for distant and multichannel large vocabulary speech recognition,” in Proc. IEEE ASRU, 2013.
 [78] C Cieri and D Millerand K Walker, “The Fisher corpus: a resource for the next generations of speechtotext,” in Proc LREC, 2004.
 [79] K Vesely, A Ghoshal, L Burget, and D Povey, “Sequencediscriminative training of deep neural networks,” in Proc. ICSA Interspeech, 2013, pp. 2345–2349.
 [80] D Povey, A Ghoshal, G Boulianne, L Burget, O Glembek, N Goel, M Hannemann, P Motlíček, Y Qian, P Schwarz, J Silovský, G Stemmer, and K Veselý, “The Kaldi speech recognition toolkit,” in Proc. IEEE ASRU, December 2011.
 [81] S Renals, N Morgan, M Cohen, and H Franco, “Connectionist probability estimation in the DECIPHER speech recognition system,” in Proc. IEEE ICASSP, 1992.

[82]
N Srivastava, G Hinton, A Krizhevsky, I Sutskever, and R Salakhutdinov,
“Dropout: A simple way to prevent neural networks from
overfitting,”
Journal of Machine Learning Research
, vol. 15, pp. 1929–1958, 2014.  [83] C Zhang and PC Woodland, “Parameterised sigmoid and ReLU hidden activation functions for DNN acoustic modelling,” in Proc. ICSA Interspeech, 2015, pp. 3224–3228.
 [84] T Anastasakos, J McDonough, R Schwartz, and J Makhoul, “A compact model for speakeradaptive training,” in Proc. ICSLP, 1996, pp. 1137–1140.
 [85] MJF Gales, “Cluster adaptive training of hidden markov models,” Speech and Audio Processing, IEEE Transactions on, vol. 8, no. 4, pp. 417–428, 2000.
 [86] D Povey, Discriminative training for large vocabulary speech recognition, Ph.D. thesis, University of Cambridge, 2003.
 [87] B Kingsbury, “Latticebased optimization of sequence classification criteria for neuralnetwork acoustic modeling,” in Proc. IEEE ICASSP, 2009, pp. 3761–3764.
Comments
There are no comments yet.