I Introduction and Summary
Deep neural network (DNN) acoustic models have significantly extended the state-of-the-art in speech recognition 
and are known to be able to learn significant invariances through many layers of non-linear transformations. If the training and deployment conditions of the acoustic model are mismatched then the runtime data distribution can differ from the training distribution, bringing a degradation in accuracy, which may be addressed through explicit adaptation to the test conditions [3, 4, 5, 6, 7, 2, 8, 9].
In this paper we explore the use of parametrised and differentiable pooling operators for acoustic adaptation. We introduce the approach of differentiable pooling using speaker-dependent pooling operators, specifically -norm pooling and weighted Gaussian pooling (Section III
), showing how the pooling parameters may be optimised by minimising the negative log probability of the class given the input data (SectionIV), and providing a justification for the use of pooling operators in adaptation (Section V). To evaluate this novel adaptation approach we performed experiments on three corpora – TED talks, Switchboard conversational telephone speech, and AMI meetings – presenting results on using differentiable pooling for speaker independent acoustic modelling, followed by unsupervised speaker adaptation experiments in which adaptation of the pooling operators is compared (and combined) with learning hidden unit contributions (LHUC) [10, 11]
and constrained/feature-space maximum likelihood linear regression (fMLLR).
Ii DNN Acoustic Modelling and Adaptation
DNN acoustic models typically estimate the posterior distribution over a set of context-dependent tied states
of a hidden Markov model (HMM) given an acoustic observation , [14, 15, 1]. The DNN is implemented as a nested function comprising processing layers (non-linear transformations):
The model is thus parametrised by a set of weights in which the
th layer consists of a weight matrix and bias vector,, followed by a non-linear transformation , acting on arbitrary input :
To form a probability distribution, the output layer employs asoftmax transformation 
, whereas the hidden layer activation functions are typically chosen to be either sigmoidor rectified linear
Yu et al 
experimentally demonstrated that the invariance of the internal representations with respect to variabilities in the input space increases with depth (the number of layers) and that the DNN can interpolate well around training samples but fails to extrapolate if the data mismatch increases. Therefore one often explicitly compensates for unseen variabilities in the acoustic space.
Feature-space normalisation increases the invariance to unseen data by transforming the data such that it better matches the training data. In this approach the DNN learns an additional transform of the input features conditioned on the speaker or the environment. The transform, which is typically affine, is parametrised by an additional set of adaptation parameters. The most effective form of feature-space normalisation is constrained (feature-space) maximum-likelihood linear regression (MLLR), referred to as fMLLR 
, in which the linear transform parameters are estimated by maximising the likelihood of the adaptation data under a Gaussian Mixture Model (GMM) / HMM acoustic model. To use fMLLR with a DNN acoustic model it is necessary to estimate a single input transform per speaker (using a trained GMM), using the resultant transformed data to train a DNN in a speaker adaptive training (SAT) manner. At runtime another set of fMLLR parameters is estimated for each speaker and the data transformed accordingly. This technique has consistently and significantly reduced the word error rate (WER) across several different benchmarks for both hybrid[14, 1] and tandem[18, 19] approaches. There are many successful examples of fMLLR adaptation of DNN acoustic models [20, 6, 21, 1, 22, 23, 24, 8, 25]. One can also estimate the linear transform as an input layer of the DNN, often referred to as a linear input network (LIN) [3, 26, 4, 6]. LIN-based approaches have been mostly used in test-only adaptation schemes, whereas fMLLR requires SAT, but usually results in lower WERs.
Auxiliary feature approaches augment the acoustic feature vectors with additional speaker-dependent information computed for each speaker at both training and runtime stages – this is a form of SAT in which the model learns the distribution over tied states conditioned on some additional speaker-specific information. There has been considerable recent work exploring the use of i-vectors  for this purpose. I-vectors, which can be regarded as basis vectors spanning a subspace of speaker variability, were first used for adaptation in a GMM framework by Karafiat et al , and were later successfully employed for DNN adaptation [29, 30, 31, 32, 33, 34]. Other examples of auxiliary features include the use of speaker-specific bottleneck features obtained from a speaker separation DNN , the use of out-of-domain tandem features , as well as speaker codes [36, 37, 38] in which a specific set of units for each speaker is optimised. Kundu et al.  present an approach using auxiliary input features derived from the bottleneck layer of a DNN which is combined with i-vectors.
Model-based approaches adapt the DNN parameters using data from the target speaker. Liao  investigated this approach in both supervised and unsupervised settings using a few minutes of adaptation data. On a large DNN, when all weights were adapted, up to 5% relative improvement was observed for unsupervised adaptation, using a speaker independent decoding to obtain DNN targets. Yu et al  have explored the use of regularisation for adapting the weights of a DNN, using the Kullback-Liebler (KL) divergence between the output distributions produced by speaker-independent and the speaker-adapted models. This approach was also recently used to adapt parameters of sequence-trained models . LIN may also be regarded as a form of model-based adaptation, and related approaches include adaptation using a linear output network (LON) or linear hidden network (LHN) [4, 7, 42].
Directly adapting all the weights of a large DNN is computationally and data intensive, and results in large speaker-dependent parameter sets. Smaller subsets of the DNN weights may be modified, including biases and slopes of hidden units [43, 7, 44, 34]. Another recently developed approach relies on learning hidden unit contributions (LHUC) for test-only adaptation [10, 11] as well as in a SAT framework . One can also adapt the top layer using Bayesian methods resulting in a maximum a posteriori (MAP) approach , or address the sparsity of context-dependent tied-states when few adaptation data-points are available by modelling both monophones and context-dependent tied states using multi-task adaptation [47, 48] or a hierarchical output layer .
Iii Differentiable Pooling
Building on our initial work , we present an approach to adaptation by learning hidden layer pooling operators with parameters that can be learned and adapted in a similar way to the other model parameters. The idea of feature pooling originates from Hubel and Wiesel’s pioneering study on visual cortex in cats 
, and was first used in computer vision to combine spatially local features
. Pooling in DNNs involves the combination of a set of hidden unit outputs into a summary statistic. Fixed poolings are typically used, such as average pooling (used in the original formulation of convolutional neural networks – CNNs)[53, 54]
and max pooling (used in the context of feature hierarchies and later applied to CNNs [56, 57]).
Reducing the dimensionality of hidden layers by pooling some subsets of hidden unit activations has become well investigated beyond computer vision, and the operator has been interpreted as a way to learn piecewise linear activation functions – referred to as Maxout . Maxout has been widely investigated for both fully-connected [59, 60, 61] and convolutional [62, 63] DNN-based acoustic models. Max pooling, although differentiable, performs a one-from- selection, and hence does not allow hidden unit outputs to be interpolated, or their combination to be learned within a pool.
There have been a number of approaches to pooling with differentiable operators – differentiable pooling – a notion introduced by Zeiler and Fergus -norm pooling with CNN models [57, 65] in which the sufficient statistic is the -norm of the group of (spatially-related) hidden unit activations. Fixed order -norm pooling was recently applied within the context of a convolutional neural network acoustic model , where it did not reduce the WER over max-pooling, and as an activation function in a fully-connected DNNs , where it was found to improve over maxout and ReLU.
Iii-a -norm (Diff-) pooling
In this approach we pool a set of activations using an -norm. A hidden unit pool is formed by a set of affine projections which form the input to the th pooling unit, which we write as an ordered set (vector) . The output of the th pooling unit is produced as an norm:
where is the learnable norm order for the th unit, that can be jointly optimised with the other parameters in the model. To ensure that (3) satisfies a triangle inequality (; a necessary property of the norm), during optimisation is re-parametrised as , where is the actual learned parameter. For the case when we obtain the max-pooling operator :
Similarly, if we obtain absolute average pooling (assuming the pool is normalised by ). We refer to this model as Diff-, and it is parametrised by . Serement et al  investigated fixed-order pooling for image classification, which was applied to speaker independent acoustic modelling . Here we allow each unit in the model to have a learnable order , and we use the pooling parameters to perform model-based test-only acoustic adaptation.
Iii-B Gaussian kernel (Diff-Gauss) pooling
The second pooling approach estimates the pooling coefficients using a Gaussian kernel. We generate the pooling inputs at each layer as:
where is a non-linearity ( in this work) and is a set of affine projections as before. A non-linearity is essential as otherwise (contrary to pooling) we would produce a linear transformation through a linear combination of linear projections. is the pool amplitude; this parameter is tied and learned per-pool as this was found to give similar results to per-unit amplitudes (but with fewer parameters), and better results compared to setting to a fixed value .
Given the activation (5), the pooling operation is defined as a weighted average over a set of hidden units, where the -th pooling unit is expressed as:
The pooling contributions are normalised to sum to one within each pooling region (7) and each weight is coupled with the corresponding value of by a Gaussian kernel (8) (one per pooling unit) parameterised by the mean and precision, :
Similar to -norm pooling, this formulation allows a generalised pooling to be learned – from average () to () – separately for each pooling unit within a model. The Diff-Gauss model is thus parametrised by .
Iv Learning Differentiable Poolers
We optimise the acoustic model parameters by minimising the negative log probability of the target HMM tied state given the acoustic observations using gradient descent and error back-propagation ; the pooling parameters may be updated in a speaker-dependent manner, to adapt the acoustic model to unseen data. In this section we give the necessary partial derivatives for Diff- and Diff-Gauss pooling.
Iv-a Learning and adapting Diff- pooling
In Diff- pooling we learn which we express in terms of , . Error back-propagation requires the partial derivative of the pooling region with respect to , which is given as:
where when and 0 otherwise. The back-propagation through the norm itself is implemented as:
where represents the element-wise Hadamard product, and is a vector of activations repeated times, so the resulting operation can be fully vectorised:
Normalisation by in (3) is optional (see also Section VII-A) and the partial derivatives in (9) and (10) hold for the un-normalised case also: the effect of this is taken into account in the forward activation .
Since (9) and (10) are not continuous everywhere, they need to be stabilised when . When computing logarithm in the numerator of (9) it is also necessary to ensure that each . In practise, we threshold each element to have at least a value if . Note, this numerical stabilisation of only applies to units, not Diff-Gauss.
Iv-B Learning and adapting Diff-Gauss pooling regions
To learn the Diff-Gauss pooling parameters , we require the partial derivatives and to update pooling parameters, as well as in order to back-propagate error signals to lower layers.
One can compute the partial derivative of (6) with respect to the input activations as:
where is the Jacobian representing the partial derivative :
whose elements can be computed as:
Likewise, represents the Jacobian of the kernel function in (8) with respect to :
and the elements of can be computed as:
Similarly, one can obtain the gradients with respect to the pooling parameters . In particular, for , the gradient is:
where = and is:
V Representational efficiency of pooling units
The aim of model-based DNN adaptation is to alter the learned speaker independent representation in order to improve the classification accuracy for data from a possibly mismatched test distribution. Owing to the highly distributed representations that are characteristic of DNNs, it is rarely clear which parameters should be adapted in order generalise well to a new speaker or acoustic condition.
Pooling enables decision boundaries to be altered, through the selection of relevant hidden features, while keeping the parameters of the feature extractors (the hidden units) fixed: this is similar to LHUC adaptation . The pooling operators allow for a geometrical interpretation of the decision boundaries and how they will be affected by a constrained adaptation – the units within the pool are jointly optimised given the pooling parametrisation, and share some underlying relationship within the pool.
This is visualised for units in Fig. 2. Fig. 2 (a) illustrates the unit circles obtained by solving for different orders , with and a pool of linear inputs . Such an unit is capable of closed-region decision boundaries, illustrated in Fig. 2 (b). The distance threshold is implicitly learned from data (through the parameters given ), resulting in an efficient representation [69, 67] compared with representing such boundaries using sigmoid units or ReLUs, which would require more parameters. Figs. 2 (c) and (d) show how those boundaries are affected when (average pooling) and (max pooling), while keeping fixed. As shown in Section VII we found that updating is an efficient and relatively low-dimensional way to adjust decision boundaries such that the the model’s accuracy on the adaptation data distribution improves.
It is also possible to update the biases (Fig. 2 (e), red contours) and the LHUC amplitudes (Fig. 2 (f), red contours). We experimentally investigate how each approach impacts adaptation WER in Section VII-B. Although models implementing Diff-Gauss units are theoretically less efficient in terms of SI representations compared to units, and comparable to standard fully-connected models, the pooling mechanism still allows for more efficient (in terms of number of SD parameters) speaker adaptation.
Vi Experimental setups
We have carried out experiments on three corpora: the TED talks corpus  following the IWSLT evaluation protocol (www.iwslt.org); the Switchboard corpus of conversational telephone speech  (ldc.upenn.edu) and the AMI meetings corpus [73, 74] (corpus.amiproject.org
). Unless explicitly stated otherwise, our baseline models share similar structure across the tasks – DNNs with 6 hidden layers (2,048 units per layer) using a sigmoid non-linearity. The output softmax layer models the distribution of context-dependent clustered tied states. The features are presented in 11 () frame long context windows. All the adaptation experiments, if not stated otherwise, were performed unsupervised using adaptation targets obtained from first-pass speaker-independent decoding of the corresponding SI system.
TED: The training data consisted of 143 hours of speech (813 talks) and the systems follow our previously described recipe . However, compared to our previous work [8, 11, 50], our systems here make use of more accurate language models developed for our IWSLT–2014 systems : in particular, the final reported results use a 4-gram language model estimated from 751 million words. The baseline TED acoustic models were trained on unadapted PLP features with first and second order time derivatives. We present results on four IWSLT test sets: dev2010, tst2010, tst2011 and tst2013 containing 8, 11, 8, and 28 talks respectively.
AMI: We follow a Kaldi GMM recipe  and use the individual headset microphone (IHM) recordings. On this corpus, we train the acoustic models using 40 mel-filter-bank (FBANK) features. We decode with a pruned 3-gram language model estimated from 800k words of AMI training transcripts interpolated with an LM trained on Fisher conversational telephone speech transcripts (1M words) .
Switchboard (SWBD): We follow a Kaldi GMM recipe [79, 80]111To stay compatible with our previous adaptation work on Switchboard [45, 68] we are using the older set of Kaldi recipe scripts called s5b, and our baseline results are comparable with the corresponding baseline numbers previously reported. A newer set of improved scripts exists under s5c which, in comparison to s5b, offer about 1.5% absolute lower WER., using Switchboard–1 Release 2 (LDC97S62). Our baseline unadapted acoustic models were trained on MFCC features, while the SAT trained fMLLR variants utilise the usual Kaldi feature preprocessing pipeline, which is MFCC+LDA/MLLT+fMLLR222MFCC-Mel-frequency Cepstral Coefficients, LDA - Linear Discriminant Analysis, MLLT - Maximum Likelihood Linear Transform. The results are reported on the full Hub5’00 set (LDC2002S09) – eval2000. eval2000 contains two types of data: Switchboard – which is better matched to the training data; and CallHome (CHE) English. Our reported results use 3-gram LMs estimated from the Switchboard and Fisher Corpus transcripts.
Vii-a Baseline speaker independent models
The structures of the differentiable pooling models were selected such that the number of parameters was comparable to the corresponding baseline DNN models, described in detail in . For the Diff- and Diff- types, the resulting models utilised non-overlapping pooling regions of size , with 900 -norm units per layer. The Diff-Gauss models had pool sizes set to (this was found to work best in our previous work ) which (assuming non-overlapping regions) results in 1175 pooling units per layer.
Training speaker independent Diff- and Diff- models: For both Diff- and Diff- we trained with an initial learning rate of .008 (for MFCC, PLP, FBANK features) and .006 (for fMLLR features). The learning rate was adjusted using the newbob learning scheme  based on the validation frame error rate. We found that applying explicit pool normalisation (dividing by in (3)) gives consistently higher error rates (typically an absolute increase of 0.3% WER): hence we used un-normalised units in all experiments. We did not apply post-layer normalisation . Instead, we use max-norm approach – after each update we scaled the columns (i.e. each ) of the fully connected weight matrices such that their norms were below a given threshold (set to in this work) . For Diff- models we initialised . Those parameters were optimised on TED and directly applied without further tuning for the other two corpora. In this work we have focussed on adaptation; Zhang et al  have reported further speaker independent experiments for fixed order units.
Training speaker independent Diff-Gauss models: The initial learning rate was set to 0.08 (regardless of the feature type), again adjusted using newbob
. Initial pooling parameters were sampled randomly from normal distribution:and . Otherwise, the hyper-parameters were the same as for the baseline DNN models.
Baseline speaker independent results: Table I gives speaker independent results for each of the considered model types. The Diff-Gauss and Diff-/Diff- models have comparable WERs, with a small preference towards Diff- in terms of the final WER on TED and AMI; all have lower average WER than the baseline DNN. The gap between the pooled models increases on AMI data where Diff- has a substantially lower WER (3.2% relative) than the fixed order Diff- which is in turn has a lower WER than the other two models (Diff-Gauss and baseline DNN) by 2.1% relative.
Fig. 3 gives more insight into the Diff- models by showing how the final distributions of the learned order differ across AMI, TED and SWBD corpora. deviates more from its initialisation in the lower layers of the model; there is also a difference across corpora. This follows the intuition of how a multi-layer network builds its representation: lower layers are more dependent on acoustic variabilities, normalising for such effects, and hence feature extractors may differ across datasets – in contrast to the upper layers which rely on features abstracted away from the acoustic data. For these corpora, the order rarely exceeded 3, sometimes dropping below 2 – especially for layer 1 with SWBD data. However, most units, especially in higher layers, tend to have . This corresponds to previous work  in which fixed units tended to obtain lower WER. A similar analysis of Diff-Gauss pooling does not show large data-dependent differences in the learned pooling parameters.
Training speed: Table II shows the average training speeds for each of the considered models. Training pooling units is significantly more expensive than training baseline DNN models. This is to be expected as the pooling operations cannot be easily and fully vectorised. In our implementation training the Diff-Gauss or Diff- models is about 40% slower than training a baseline DNN. Not optimising during training (9) decreases the gap to about 20% slower. This indicates that training using fixed units, and then adapting the order in a speaker adaptive manner could make a good compromise.
Vii-B Adaptation experiments
We initially used the TED talks corpus to investigate how WERs are affected by adapting different layers in the model. The results indicated that adapting only the bottom layer brings the largest drop in WER; however, adapting more layers further improves the accuracy for both Diff- and Diff-Gauss models (Fig. 4 (a)). Since obtaining the gradients for the pooling parameters at each layer is inexpensive compared to the overall back-propagation, and adapting bottom layer gives largest gains, in the remainder of this work we adapt all pooling units. Similar trends hold when pooling adaptation is combined with LHUC adaptation, which on tst2010 improves the accuracies by 0.2-0.3% absolute.
(b) shows WER vs. the number of adaptation iterations. The results indicate that one adaptation iteration is sufficient and, more importantly, the model does not overfit when more iterations are used. This suggests that it is not necessary to regularise the model carefully (by Kullback-Leibler divergence, for instance) which is usually required when weights that directly transform the data are adapted. In the remainder, we adapt all models with a learning rate of for three iterations (optimised on dev2010).
Table III shows the effect of adapting different pooling parameters (including LHUC amplitudes) for units. Updating only , rather than any other stand-alone pooling parameter, gives a lower WER than LHUC adaptation with the same number of parameters (cf Fig. 2); however, updating both brings further reductions in WER. Adapting the bias is more data-dependent with a substantial increase in WER for SWBD; this also significantly increases the number of adapted parameters. Hence we adapted either alone, or with LHUC in the remaining experiments
Table IV shows similar analysis but for Diff-Gauss model. For Diff-Gauss, it is beneficial to update both and (as in ), and LHUC was also found to be complementary. Notice, adapting with LHUC scalers is similar to altering in eq. (5) (assuming is tied per pool, as mentioned in Section III-B). As such, new parameters need not be introduced to adapt Diff-Gauss with LHUC as it is the case for Diff- units. In fact, last two rows of Table IV show that jointly updating , and gives lower WER than updating , and applying LHUC after pooling (see Fig. 1).
Analysis of Diff-: Fig. 5 shows how the distribution of changes after the Diff- model adapts to each of the 28 speakers of tst2013. We plot the speaker independent histograms as well as the contours of the mean bin frequencies for each layer. For the adapted models the distributions of become less dispersed, especially in higher layers, which can be interpreted as shrinking the decision regions of particular units (cf Fig. 2). This follows the intuition that speaker adaptation involves reducing the variability that needs to be modelled, in contrast to the speaker independent model.
Taking into account the increased training time of Diff- models, one can also consider training fixed order Diff- , adapting using (9). The results in Fig. 5, as well as later results, cover this scenario. The adapted Diff- models display a similar trend in the distribution of to the Diff- models.
Analysis of Diff-Gauss: We performed a similar investigation on the learned Diff-Gauss pooling parameters (Fig. 6). In the bottom layers they are characterised by a large negative means and positive precisions which has the effect of turning off many units. After adaptation, some of them become more active, which can be seen based on shifted distributions of adapted pooling parameters in Fig. 6. The adaptation with Diff-Gauss has a similar effect as the adaptation of slopes and amplitudes [44, 83], but adapts times fewer parameters.
Amount of adaptation data and quality of targets: We investigated the effect of the amount of adaptation data by randomly selecting adaptation utterances from tst2010 to give totals of 10s, 30s, 60s, 120s, 300s and more speaker-specific adaptation data per talker (Fig. 7 (a)). The WERs are an average over three independent runs, each sampling a different set of adaptation utterances (we did more passes in our previous work [11, 50], however, both LHUC and differentiable pooling operators were not sensitive to this aspect, resulting in small error bars between different results obtained with different random utterances). The Diff- models offer lower WER and more rapid adaptation, with 10s of adaptation data resulting in a decrease in WER by 0.6% absolute (3.6% relative) which further increases up to 2.1% absolute (14.4% relative) when using all the speaker’s data in an unsupervised manner. Diff-Gauss is comparable in terms of WER to a DNN adapted with LHUC. In addition, both methods are complementary to LHUC adaptation, and to feature-space adaptation with fMLLR (Tables VI and VII).
In order to demonstrate the modelling capacities of the different model-based adaptation techniques, we carried out a supervised adaptation (oracle) experiment in which the adaptation targets were obtained by aligning the audio data with reference transcripts (Fig. 7
(b)). We do not refine what the model knows about speech, nor the way it classifies it (the feature receptors and output layer are fixed during adaptation and remain speaker independent), but show that the re-composition and interpolation of these basis functions to approximate the unseen distribution of adaptation data is able to decrease the WER by 26.7% relative forDiff- + LHUC scenario.
The methods are also not very sensitive to the quality of adaptation targets, they show very similar trends as LHUC, for which exact results for different qualities of adaptation targets resulting from re-scoring adaptation hypotheses with different language models were reported in .
Summary: Results for the proposed techniques are summarised in Tables V, VI, and VII for AMI, TED, and SWBD, respectively. The overall observed trends are as follows: (I) speaker independent pooling models return lower WERs than the baseline DNNs: Diff-Gauss Diff- Diff- (although the last two seem to be data-dependent); (II) the pooling models (Diff-Gauss, Diff- and Diff-) are complementary to both fMLLR and LHUC adaptation – as expected, the final gain depends on the degree of data mismatch; (III) one can effectively train speaker independent Diff- models and later alter in a speaker dependent manner; (IV) the average relative improvement across all tasks with respect to baseline unadapted DNN models were 6.8% for Diff-Gauss, 9.1% for Diff- and 10.4% for Diff-; and (V) when comparing LHUC adapted DNN to LHUC adapted differentiable pooling models, the relative reductions in WER for the pooling models were 2%, 3.4% and 4.8% for Diff-Gauss, Diff- and Diff-, respectively.
Viii Discussion and Conclusions
We have proposed the use of differentiable pooling operators with DNN acoustic models to perform unsupervised speaker adaptation. Differentiable pooling operators offer a relatively-low dimensional set of parameters which may be adapted in a speaker-dependent fashion.
We investigated the complementarity of differentiable pooling adaptation with two other approaches – model-based LHUC adaptation and feature-space fMLLR adaptation. We have not performed an explicit comparison with an i-vector approach to adaptation. However, some recent papers have compared i-vector adaptation with either LHUC and/or fMLLR on similar data which enables us some make indirect comparisons. For example, Samarakoon and Sim  showed that speaker-adaptive training with i-vectors gives a comparable results to test-only LHUC using TED data, and Miao et. al  suggested that LHUC is better than a standard use of i-vectors (as in Saon et al. ) on TED data, with a more sophisticated i-vector post-processing needed to equal LHUC. Since the proposed Diff- and Diff-Gauss techniques resulted in WERs that were at least as good as LHUC (and were found to be complementary to fMLLR) we conclude that the proposed pooling-based adaptation techniques are competitive.
In the future, one could investigate extending the proposed techniques to speaker adaptive training (SAT) [84, 85], for example in a similar spirit as proposed in the context of SAT-LHUC . In addition it would be interesting to investigate the suitability of adapting pooling regions in the framework of sequence discriminative training [86, 87, 79]. Our experience of LHUC in this framework , together with the observation that the pooling models are not prone to over-fitting in the case of small amounts of adaptation data, suggests that adaptation based on differentiable pooling is a promising technique for sequence trained models.
The NST research data collection may be accessed at http://datashare.is.ed.ac.uk/handle/10283/786. This research utilised a K40 GPGPU board donated by NVIDA Corporation. The authors would like to thank the reviewers for insightful comments that helped to improve the manuscript.
-  G Hinton, L Deng, D Yu, GE Dahl, A Mohamed, N Jaitly, A Senior, V Vanhoucke, P Nguyen, TN Sainath, and B Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, Nov 2012.
-  D Yu, M Seltzer, J Li, J-T Huang, and F Seide, “Feature learning in deep neural networks - studies on speech recognition,” in Proc. ICLR, 2013.
-  J Neto, L Almeida, M Hochberg, C Martins, L Nunes, S Renals, and T Robinson, “Speaker adaptation for hybrid HMM–ANN continuous speech recognition system,” in Proc. Eurospeech, 1995, pp. 2171–2174.
-  B Li and KC Sim, “Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems,” in Proc. Interspeech, 2010.
-  J Trmal, J Zelinka, and L Müller, “On speaker adaptive training of artificial neural networks,” in Proc. Interspeech, 2010.
-  F Seide, X Chen, and D Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,” in Proc. IEEE ASRU, 2011.
-  K Yao, D Yu, F Seide, H Su, L Deng, and Y Gong, “Adaptation of context-dependent deep neural networks for automatic speech recognition.,” in Proc. IEEE SLT, 2012, pp. 366–369.
-  P Swietojanski, A Ghoshal, and S Renals, “Revisiting hybrid and GMM-HMM system combination techniques,” in Proc. IEEE ICASSP, 2013, pp. 6744–6748.
-  D Yu, K Yao, H Su, G Li, and F Seide, “KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition.,” in Proc. IEEE ICASSP, 2013, pp. 7893–7897.
-  O Abdel-Hamid and H Jiang, “Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition.,” in Proc. ICSA Interspeech, 2013, pp. 1248–1252.
-  P Swietojanski and S Renals, “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models,” in Proc. IEEE SLT, 2014.
-  MJF Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12, pp. 75–98, April 1998.
-  SJ Young and PC Woodland, “State clustering in hidden Markov model-based continuous speech recognition,” Computer Speech and Language, vol. 8, no. 4, pp. 369–383, 1994.
-  H Bourlard and N Morgan, Connectionist Speech Recognition: A Hybrid Approach, Kluwer Academic Publishers, 1994.
-  S Renals, N Morgan, H Bourlard, M Cohen, and H Franco, “Connectionist probability estimators in HMM speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 2, pp. 161–174, 1994.
“Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition,”in Neurocomputing, F Fogelman Soulié and J Hérault, Eds., pp. 227–236. Springer, 1990.
-  V Nair and G Hinton, in Proc. ICML, 2010, pp. 131–136.
-  H Hermansky, DPW Ellis, and S Sharma, “Tandem connectionist feature extraction for conventional HMM systems,” in Proc. IEEE ICASSP, 2000, pp. 1635–1638.
-  F Grezl, M Karafiat, S Kontar, and J Cernocky, “Probabilistic and bottle-neck features for LVCSR of meetings,” in Proc. IEEE ICASSP, 2007, pp. IV–757–IV–760.
A Mohamed, TN Sainath, G Dahl, B Ramabhadran, GE Hinton, and MA Picheny,
“Deep belief networks using discriminative features for phone recognition,”in Proc. IEEE ICASSP, 2011, pp. 5060–5063.
-  T Hain, L Burget, J Dines, PN Garner, F Grézl, A El Hannani, M Karafíat, M Lincoln, and V Wan, “Transcribing meetings with the AMIDA systems,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, pp. 486–498, 2012.
-  TN Sainath, B Kingsbury, and B Ramabhadran, “Auto-encoder bottleneck features using deep belief networks.,” in Proc. IEEE ICASSP, 2012, pp. 4153–4156.
-  TN Sainath, A Mohamed, B Kingsbury, and B Ramabhadran, “Deep convolutional neural networks for lvcsr,” in Proc. IEEE ICASSP, 2013, pp. 8614–8618.
-  P Bell, P Swietojanski, and S Renals, “Multi-level adaptive networks in tandem and hybrid ASR systems,” in Proc. IEEE ICASSP, 2013.
-  T Yoshioka, A Ragni, and MJF Gales, “Investigation of unsupervised adaptation of dnn acoustic models with filter bank input,” in Proc. IEEE ICASSP, 2014, pp. 6344–6348.
-  V Abrash, H Franco, A Sankar, and M Cohen, “Connectionist speaker normalization and adaptation,” in Proc. Eurospeech, 1995, pp. 2183–2186.
-  N Dehak, PJ Kenny, R Dehak, P Dumouchel, and P Ouellet, “Front end factor analysis for speaker verification,” IEEE Trans Audio, Speech and Language Processing, vol. 19, pp. 788–798, 2010.
-  M Karafiat, L Burget, P Matejka, O Glembek, and J Cernozky, “I-vector-based discriminative adaptation for automatic speech recognition,” in Proc. IEEE ASRU, 2011.
-  G Saon, H Soltau, D Nahamoo, and M Picheny, “Speaker adaptation of neural network acoustic models using i-vectors.,” in Proc. IEEE ASRU, 2013, pp. 55–59.
-  A Senior and I Lopez-Moreno, “Improving DNN speaker independence with i-vector inputs.,” in Proc. IEEE ICASSP, 2014, pp. 225–229.
-  V Gupta, P Kenny, P Ouellet, and T Stafylakis, “I-vector based speaker adaptation of deep neural networks for french broadcast audio transcription,” in Proc. IEEE ICASSP, 2014.
-  P Karanasou, Y Wang, MJF Gales, and PC Woodland, “Adaptation of deep neural network acoustic models using factorised i-vectors,” in Proc. ICSA Interspeech, 2014, pp. 2180–2184.
-  Y Miao, H Zhang, and F Metze, “Speaker adaptive training of deep neural network acoustic models using i-vectors,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 11, pp. 1938–1949, Nov 2015.
-  L Samarakoon and K C Sim, “On combining i-vectors and discriminative adaptation methods for unsupervised speaker normalization in dnn acoustic models,” in Proc. IEEE ICASSP, 2016, pp. 5275–5279.
-  Y Liu, P Zhang, and T Hain, “Using neural network front-ends on far field multiple microphones based speech recognition,” in Proc. IEEE ICASSP, 2014, pp. 5542–5546.
-  JS Bridle and S Cox, “Recnorm: Simultaneous normalisation and classification applied to speech recognition,” in Advances in Neural Information and Processing Systems, 1990.
-  O Abdel-Hamid and H Jiang, “Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code,” in Proc. IEEE ICASSP, 2013, pp. 4277–4280.
-  S Xue, O Abdel-Hamid, J Hui, L Dai, and Q Liu, “Fast adaptation of deep neural network based on discriminant codes for speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1713–1725, Dec 2014.
-  S Kundu, G Mantena, Y Qian, T Tan, M Delcroix, and KC Sim, “Joint acoustic factor learning for robust deep neural network based automatic speech recognition,” in Proc. IEEE ICASSP, March 2016, pp. 5025–5029.
-  H Liao, “Speaker adaptation of context dependent deep neural networks.,” in Proc. IEEE ICASSP, 2013, pp. 7947–7951.
-  Y Huang and Y Gong, “Regularized sequence-level deep neural network model adaptation,” in Proc. ICSA Interspeech, 2015, pp. 1081–1085.
-  T Ochiai, S Matsuda, X Lu, C Hori, and S Katagiri, “Speaker adaptive training using deep neural networks,” in Proc. IEEE ICASSP, 2014, pp. 6349–6353.
-  SM Siniscalchi, J Li, and CH Lee, “Hermitian polynomial for speaker adaptation of connectionist speech recognition systems,” IEEE Trans Audio, Speech, and Language Processing, vol. 21, pp. 2152–2161, 2013.
-  Y Zhao, J Li, J Xue, and Y Gong, “Investigating online low-footprint speaker adaptation using generalized linear regression and click-through data,” in Proc. IEEE ICASSP, 2015, pp. 4310–4314.
-  P Swietojanski and S Renals, “SAT-LHUC: Speaker adaptive training for learning hidden unit contributions,” in Proc. IEEE ICASSP, 2016, pp. 5010–5014.
-  Z Huang, S M Siniscalchi, I-F Chen, J Wu, and C-H Lee, “Maximum a-posteriori adaptation of network parameters in deep models,” arXiv preprint arXiv:1503.02108, 2015.
-  Z Huang, J Li, SM Siniscalchi, I-F Chen, J Wu, and C-H Lee, “Rapid adaptation for deep neural networks through multi-task learning,” in Proc. ICSA Interspeech, 2015, pp. 3625–3629.
-  P Swietojanski, P Bell, and S Renals, “Structured output layer with auxiliary targets for context-dependent acoustic modelling,” in Proc. ICSA Interspeech, 2015, pp. 3605–3609.
-  R Price, K Iso, and K Shinoda, “Speaker adaptation of deep neural networks using a hierarchy of output layers,” in Proc. IEEE SLT, 2014.
-  P Swietojanski and S Renals, “Differentiable pooling for unsupervised speaker adaptation,” in Proc. IEEE ICASSP, 2015, pp. 4305–4309.
-  D Hubel and T Wiesel, “Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex,” Journal of Physiology, vol. 160, pp. 106–154, 1962.
-  K Fukushima and S Miyake, “Neocognitron: A new algoriothm for pattern recognition tolerant of deformations,” Pattern Recognition, vol. 15, pp. 455–469, 1982.
Y LeCun, B Boser, JS Denker, D Henderson, RE Howard, W Hubbard, and
“Backpropagation applied to handwritten zip code recognition,”Neural Computation, vol. 1, pp. 541–551, 1989.
-  Y LeCun, L Bottou, Y Bengio, and P Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, pp. 2278–2324, 1998.
-  M Riesenhuber and T Poggio, “Hierarchical models of object recognition in cortex,” Nature Neuroscience, vol. 2, pp. 1019–1025, 1999.
MA Ranzato, FJ Huang, Y-L Boureau, and Y LeCun,
“Unsupervised learning of invariant feature hierarchies with applications to object recognition,”in Proc. IEEE CVPR, 2007.
-  Y-L Boureau, J Ponce, and Y LeCun, “A theoretical analysis of feature pooling in visual recognition,” in Proc. ICML, 2010.
-  IJ Goodfellow, D Warde-Farley, M Mirza, A Courville, and Y Bengio, “Maxout networks,” in Proc. ICML, 2013, pp. 1319–1327.
-  Y Miao, F Metze, and S Rawat, “Deep maxout networks for low-resource speech recognition,” in Proc. IEEE ASRU, 2013.
-  M Cai, Y Shi, and J Liu, “Deep maxout neural networks for speech recognition,” in Proc. IEEE ASRU, 2013, pp. 291–296.
-  P Swietojanski, J Li, and J-T Huang, “Investigation of maxout networks for speech recognition,” in Proc. IEEE ICASSP, 2014.
-  S Renals and P Swietojanski, “Neural networks for distant speech recognition,” in Proc. HSCMA, 2014.
-  L Toth, “Convolutional deep maxout networks for phone recognition,” in Proc. ICSA Interspeech, 2014.
-  MD Zeiler and R Fergus, “Differentiable pooling for hierarchical feature learning,” CoRR, vol. abs/1207.0151, 2012.
-  P Sermanet, S Chintala, and Y LeCun, “Convolutional neural networks applied to house numbers digit classification,” CoRR, vol. abs/1204.3968, 2012.
-  TN Sainath, B Kingsbury, A Mohamed, GE Dahl, G Saon, H Soltau, T Beran, AY Aravkin, and B Ramabhadran, “Improvements to deep convolutional neural networks for LVCSR,” in Proc. IEEE ASRU, 2013, pp. 315–320.
-  X Zhang, J Trmal, D Povey, and S Khudanpur, “Improving deep neural network acoustic models using generalized maxout networks,” in Proc. IEEE ICASSP, 2014.
-  P Swietojanski, J Li, and S Renals, “Learning hidden unit contributions for unsupervised acoustic model adaptation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 8, pp. 1450–1463, 2016.
C Gülçehre, K Cho, R Pascanu, and Y Bengio,
“Learned-norm pooling for deep feedforward and recurrent neural networks,”in Proc. ECML and KDD. 2014, pp. 530–546, Springer-Verlag.
-  DE Rumelhart, GE Hinton, and RJ Williams, “Learning internal representations by error-propagation,” in Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press, 1986.
-  M Cettolo, C Girardi, and M Federico, “Wit: Web inventory of transcribed and translated talks,” in Proc. EAMT, 2012, pp. 261–268.
-  JJ Godfrey, EC Holliman, and J McDaniel, “SWITCHBOARD: Telephone speech corpus for research and development,” in Proc. IEEE ICASSP. IEEE, 1992, pp. 517–520.
-  J Carletta, “Unleashing the killer corpus: Experiences in creating the multi-everything AMI meeting corpus.,” Language Resources and Evaluation, vol. 41, no. 2, pp. 181–190, 2007.
-  S Renals, T Hain, and H Bourlard, “Recognition and understanding of meetings: The AMI and AMIDA projects,” in Proc. IEEE ASRU, Kyoto, 12 2007, IDIAP-RR 07-46.
-  GE Dahl, D Yu, L Deng, and A Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transaction on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
-  P Bell, P Swietojanski, J Driesen, M Sinclair, F McInnes, and S Renals, “The UEDIN system for the IWSLT 2014 evaluation,” in Proc. IWSLT, 2014, pp. 26–33.
-  P Swietojanski, A Ghoshal, and S Renals, “Hybrid acoustic models for distant and multichannel large vocabulary speech recognition,” in Proc. IEEE ASRU, 2013.
-  C Cieri and D Millerand K Walker, “The Fisher corpus: a resource for the next generations of speech-to-text,” in Proc LREC, 2004.
-  K Vesely, A Ghoshal, L Burget, and D Povey, “Sequence-discriminative training of deep neural networks,” in Proc. ICSA Interspeech, 2013, pp. 2345–2349.
-  D Povey, A Ghoshal, G Boulianne, L Burget, O Glembek, N Goel, M Hannemann, P Motlíček, Y Qian, P Schwarz, J Silovský, G Stemmer, and K Veselý, “The Kaldi speech recognition toolkit,” in Proc. IEEE ASRU, December 2011.
-  S Renals, N Morgan, M Cohen, and H Franco, “Connectionist probability estimation in the DECIPHER speech recognition system,” in Proc. IEEE ICASSP, 1992.
N Srivastava, G Hinton, A Krizhevsky, I Sutskever, and R Salakhutdinov,
“Dropout: A simple way to prevent neural networks from
Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
-  C Zhang and PC Woodland, “Parameterised sigmoid and ReLU hidden activation functions for DNN acoustic modelling,” in Proc. ICSA Interspeech, 2015, pp. 3224–3228.
-  T Anastasakos, J McDonough, R Schwartz, and J Makhoul, “A compact model for speaker-adaptive training,” in Proc. ICSLP, 1996, pp. 1137–1140.
-  MJF Gales, “Cluster adaptive training of hidden markov models,” Speech and Audio Processing, IEEE Transactions on, vol. 8, no. 4, pp. 417–428, 2000.
-  D Povey, Discriminative training for large vocabulary speech recognition, Ph.D. thesis, University of Cambridge, 2003.
-  B Kingsbury, “Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling,” in Proc. IEEE ICASSP, 2009, pp. 3761–3764.