Speech recognition is a typical sequence to sequence transduction problem, i.e., given a sequence of acoustic observations, the speech recognition engine decodes the corresponding sequence of words or phonemes. A key component in a speech recognition system is the acoustic model, which computes the conditional probability of the output sequence given the input sequence. However, directly computing this conditional probability is challenging due to many factors including the variable lengths of the input and output sequences. The hidden Markov model (HMM) converts this sequence-level classification task into a frame-level classification problem, where each acoustic frame is classified into one of the hidden states, and each output sequence corresponds to a sequence of hidden states. To make it computationally tractable, HMMs usually rely on the conditional independence assumption and the first-order Markov rule — the well-known weaknesses of HMMs. Furthermore, the HMM-based pipeline is composed of a few relatively independent modules, which makes the joint optimisation nontrivial.
There has been a consistent research effort to seek architectures to replace HMMs and overcome their limitation for acoustic modelling, e.g., [2, 3, 4, 5]; however these approaches have not yet improved speech recognition accuracy over HMMs. In the past few years, several neural network based approaches have been proposed and demonstrated promising results. In particular, the connectionist temporal classification (CTC) [6, 7, 8, 9]
approach defines the loss function directly to maximise the conditional probability of the output sequence given the input sequence, and it usually uses a recurrent neural network to extract features. However, CTC simplifies the sequence-level error function by a product of the frame-level error functions (i.e., independence assumption), which means it essentially still does frame-level classification. It also requires the lengths of the input and output sequence to be the same, which is inappropriate for speech recognition. CTC deals with this problem by replicating the output labels so that a consecutive frames may correspond to the same output label or ablank token.
. A key difference of this model from HMMs and CTCs is that the attention-based approach does not apply the conditional independence assumption to the input sequence. Instead, it maps the variable-length input sequence into a fixed-size vector representation at each decoding step by an attention-based scheme (see for further explanation). It then generates the output sequence using an RNN conditioned on the vector representation from the source sequence. The attentive scheme suits the machine translation task well, because there may be no clear alignment between the source and target sequence for many language pairs. However, this approach does not naturally apply to the speech recognition task, as each output token only corresponds to a small size window of acoustic spectrum.
In this paper, we study segmental RNNs  for acoustic modelling. This model is similar to CTC and attention-based RNN in the sense that an RNN encoder is also used for feature extraction, but it differs in the sense that the sequence-level conditional probability is defined using an segmental (semi-Markov) CRF , which is an extension on the standard CRF . There have been numerous works on CRFs and their variants for speech recognition, e.g, [4, 5, 17] (see 
for an overview). In particular, feed-forward neural networks have been used with segmental CRFs for speech recognition[19, 20]. However, segmental RNNs are different in that they are end-to-end models — they do not depend on external systems to provide segmentation boundaries and features, instead, they are trained by marginalising out all possible segmentations, while the features are derived from the encoder RNNs, which are trained jointly with the segmental CRFs. Our experiments were performed on the TIMIT dataset, and we achieved 17.3% PER from first-pass decoding with zeroth-order CRF and without using any language model — the best reported result using CRFs.
2 Segmental Recurrent Neural Networks
2.1 Segmental Conditional Random Fields
Given a sequence of acoustic frames and its corresponding sequence of output labels , where , segmental (or semi-Markov) conditional random field defines the sequence-level conditional probability with the auxiliary segment labels as
where is a tuple of the beginning () and the end () time tag for the segment of , and while ; and denotes the vocabulary set; is the normaliser that that sums over all the possible pairs, i.e.,
Here, we only consider the zeroth-order CRF, while the extension to higher order models is straightforward. Similar to other CRF-based models, the function is defined as
where denotes the feature function, and
is the weight vector. Previous works on CRF-based acoustic models mainly use heuristically handcrafted feature function. They also usually rely on an external system to provide the segment labels. In this paper, we define using neural networks, and the segmentation is marginalised out during training, which makes our model self-contained.
2.2 Feature Representations
We use neural networks to define the feature function , which maps the acoustic segment and its corresponding label into a joint feature space. More specifically, is firstly represented as a one-hot vector , and it is then mapped into a continuous space by a linear embedding matrix as
Given the segment label , we use an RNN to map the acoustic segment to a fixed-dimensional vector representation, i.e.,
where denotes the initial hidden state, denotes the duration of the segment and is a non-linear function. We take the final hidden state as the segment embedding vector, then can be represented as
corresponds to one layer or multiple layers of linear or non-linear transformation. In fact, it is flexible to include other relevant features as additional inputs to the function, e.g., the duration feature which can be obtained by converting into another embedding vector. In practice, multiple RNN layers can be used transform the acoustic signal before extracting the segment embedding vector as Figure 1.
2.3 Conditional Maximum Likelihood Training
For speech recognition, the segmentation labels are usually unknown, training the model by maximising the conditional probability as Eq. (1) is therefore not practical. The problem can be addressed by defining the loss function as the negative marginal log-likelihood as
where denotes the set of model parameters, and denotes the summation over all the possible segmentations when only is observed. To simplify notations, the objective function is define with only one training utterance.
However, the number of possible segmentations is exponential with the length of , which makes the naive computation of both and impractical. Fortunately, this can be addressed by using the following dynamic programming algorithm as proposed in :
In Eq. (11), the first summation is over all the possible segmentation up to timestep , and the second summation is over all the possible labels from the vocabulary. The computation cost of this algorithm is , where is the size of the vocabulary. The cost can be further reduced by introducing an upper bound of the segment length, in which case Eq. (11) can be rewritten as
where denotes the maximum value of the segment length. The cost is then reduced to , and for long sequences like speech signals where , the computational savings are substantial.
The term can be computed similarly. In this case, since the label is now observed, the summation over all the possible labels in Eq. (11) is not necessary, i.e.,
Again, we can limit the length of the possible segments as Eq. (13). Given and , the loss function can be minimised using the stochastic gradient decent (SGD) algorithm similar to training other neural network models. Other losses, for example, hinge, can be considered in future work.
2.4 Viterbi Decoding
During decoding, we need to search the target label sequence
that yields the highest posterior probability givenby marginalising out all the possible segmentations:
This involves minor modification of the recursive algorithm in Eq. (11) that instead of summing over all the possible labels, the Viterbi path up to the timestep is
However, marginalising out all the possible segmentations is still expensive. The computational cost can be further reduced by greedy searching the most likely segmentation, i.e.,
which corresponds to the decoding objective as
This joint maximization algorithm may yield high search error, because it only considers one segmentation. In the future, we shall investigate the beam search algorithm which may yield a lower search error.
2.5 Further Speedup
It is computationally expensive for RNNs to model long sequences, and the number of possible segmentations is exponential with the length of the input sequence as mentioned before. The computational cost can be significantly reduced by using the hierarchical subsampling RNN  to shorten the input sequences, where the subsampling layer takes a window of hidden states from the lower layer as input as shown in Figure 2. In this work, we consider three variants: a) concatenate – the hidden states in the subsampling window are concatenated before been fed into the next layer; b) add – the hidden states are added into one vector for the next layer; c) skip – only the last hidden state in the window is kept and all the others are skipped. The last two schemes are computationally cheaper as they do not introduce extra model parameters.
3.1 System Setup
We used the TIMIT dataset to evaluate the segmental RNN acoustic models. This dataset was preferred for the rapid evaluation of different system settings, and for the comparison to other CRF and end-to-end systems. We followed the standard protocol of the TIMIT dataset, and our experiments were based on the Kaldi recipe . We used the core test set as our evaluation set, which has 192 utterances. We used 24 dimensional log fiterbanks (FBANKs) with delta and double-delta coefficients, yielding 72 dimensional feature vectors. Our models were trained with 48 phonemes, and their predictions were converted to 39 phonemes before scoring. The dimension of
was fixed to be 32. For all our experiments, we used the long short-term memory (LSTM) networks as the implementation of RNNs, and the networks were always bi-directional. We set the initial SGD learning rate to be 0.1, and we exponentially decay the learning rate by a factor of 2 when the validation error stopped decreasing. Our models were trained with dropout regularisation , using an specific implementation for recurrent networks . The dropout rate was 0.2 unless specified otherwise. Our models were randomly initialised with the same random seed.
3.2 Results of Hierarchical Subsampling
We first demonstrate the results of the hierarchical subsampling recurrent network, which is the key to speed up our experiments. We set the size of the subsampling window to be 2, therefore each subsampling layer reduced the time resolution by a factor of 2. We set the maximum segment length in Eq. (14) to be 300 milliseconds, which corresponded to 30 frames of FBANKs (sampled at the rate of 10 milliseconds). With two layers of subsampling recurrent networks, the time resolution was reduced by a factor of 4, and the value of was reduced to be 8, yielding around 10 times speedup as shown in Table 1.
Table 2 compares the three implementations of the recurrent subsampling network detailed in section 2.5. We observed that concatenating all the hidden states in the subsampling window did not yield lower phone error rate (PER) than using the simple skipping approach, which may be due to the fact that the TIMIT dataset is small and it prefers a smaller model. On the other hand, adding the hidden states in the subsampling window together worked even worse, possibly due to that the sequential information in the subsampling window was flattened. In the following experiments, we sticked to the skipping method, and using two subsampling layers.
Results of tuning the hyperparameters.
3.3 Hyperparameters and Different Features
We then evaluated the model by tuning the hyperparameters, and the results are given in Table 3. We tuned the number of LSTM layers, and the dimension of LSTM cells, as well as the dimensions of and the segment vector . In general, larger models with dropout regularisation yielded higher recognition accuracy. Our best result was obtained using 6 layers of 250-dimensional LSTMs. However, without the dropout regularisation, the model can be easily overfit due to the small size of training set. In the future, we shall evaluate this model with a large dataset.
We then evaluated another two types of features using the same system configuration that achieved the best result in Table 3. We increased the number of FBANKs from 24 to 40, which yielded slightly lower PER. We also evaluated the standard Kaldi features — 39 dimensional MFCCs spliced by a context window of 7, followed by LDA and MLLT transform and with feature-space speaker-dependent MLLR, which were the same features used in the HMM-DNN baseline in Table 5. The well-engineered features improved the accuracy of our system by more than 1% absolute.
|first-pass SCRF ||33.1|
|Boundary-factored SCRF ||26.5|
|Deep Segmental NN ||21.9|
|Discriminative segmental cascade ||21.7|
|+ 2nd pass with various features||19.9|
|RNN transducer ||–||17.7|
|Attention-based RNN ||–||17.6|
3.4 Comparison to Related Works
In Table 5, we compare our result to other reported results using segmental CRFs as well as recent end-to-end systems. Previous state-of-the-art result using segmental CRFs on the TIMIT dataset is reported in , where the first-pass decoding was used to prune the search space, and the second-pass was used to re-score the hypothesis using various features including neural network features. Besides, the ground-truth segmentation was used in . We achieved considerably lower PER with first-pass decoding, despite the fact that our CRF was zeroth-order, and we did not use any language model. Furthermore, our results are also comparable to that from the CTC and attention-based RNN end-to-end systems. The accuracy of segmental RNNs may be further improved by using higher-order CRFs or incorporating a language model into the decode step, and using beam search to reduce the search error.
In this paper, we present the segmental RNN — a novel acoustic model that combines the segmental CRF with an encoder RNN for end-to-end speech recognition. We discuss the practical training and decoding algorithms of this model for speech recognition, and the subsampling network to reduce the computational cost. Our experiments were performed on the TIMIT dataset, and we achieved strong recognition accuracy using zeroth-order CRF, and without using any language model. In the future, we shall investigate discriminative training criteria, and incorporating a language model into the decoding step. Future works also include implementing a weighted finite sate transducer (WFST) based decoder and scaling this model to large vocabulary datasets.
-  D. Gillick, L. Gillick, and S. Wegmann, “Don’t multiply lightly: Quantifying problems with the acoustic model assumptions in speech recognition,” in Proc. ASRU. IEEE, 2011, pp. 71–76.
-  M. Ostendorf, V. Digalakis, and O. Kimball, “From HMM’s to segment models: A unified view of stochastic modeling for speech recognition,” IEEE Transactions on Speech and Audio Processing, pp. 360–378, 1996.
-  N. Smith and M. Gales, “Speech recognition using SVMs,” in Advances in neural information processing systems, 2001, pp. 1197–1204.
-  A. Gunawardana, M. Mahajan, A. Acero, and J. C. Platt, “Hidden conditional random fields for phone classification.” in INTERSPEECH, 2005, pp. 1117–1120.
-  Y. Hifny and S. Renals, “Speech recognition using augmented conditional random fields,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 17, no. 2, pp. 354–365, 2009.
-  A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proc. ICML, 2014, pp. 1764–1772.
-  A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger et al., “Deep Speech: Scaling up end-to-end speech recognition,” in arXiv preprint arXiv:1412.5567, 2014.
-  H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition,” in Proc. INTERSPEECH, 2015.
-  Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in Proc. ASRU, 2015.
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” inProc. ICLR, 2015.
-  J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in Neural Information Processing Systems, 2015, pp. 577–585.
-  L. Lu, X. Zhang, K. Cho, and S. Renals, “A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition,” in Proc. INTERSPEECH, 2015.
-  W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.
-  L. Kong, C. Dyer, and N. A. Smith, “Segmental recurrent neural networks,” arXiv preprint arXiv:1511.06018, 2015.
-  S. Sarawagi and W. W. Cohen, “Semi-markov conditional random fields for information extraction,” in Advances in neural information processing systems, 2004, pp. 1185–1192.
-  J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. ICML, 2001, pp. 282–289.
-  G. Zweig, P. Nguyen, D. Van Compernolle, K. Demuynck, L. Atlas, P. Clark et al., “Speech recognition with segmental conditional random fields: A summary of the JHU CLSP 2010 summer workshop,” in Proc. ICASSP. IEEE, 2011, pp. 5044–5047.
-  E. Fosler-Lussier, Y. He, P. Jyothi, and R. Prabhavalkar, “Conditional random fields in speech, audio, and language processing,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1054–1075, 2013.
-  O. Abdel-Hamid, L. Deng, D. Yu, and H. Jiang, “Deep segmental neural networks for speech recognition.” in Proc. INTERSPEECH, 2013, pp. 1849–1853.
-  Y. He and E. Fosler-Lussier, “Segmental conditional random fields with deep neural networks as acoustic models for first-pass word recognition,” in Proc. INTERSPEECH, 2015.
-  A. Graves, “Hierarchical subsampling networks,” in Supervised Sequence Labelling with Recurrent Neural Networks. Springer, 2012, pp. 109–131.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlıcek, Y. Qian, P. Schwarz, J. Silovský, G. Semmer, and K. Veselý, “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting,”
The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
-  W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2014.
-  G. Zweig, “Classification and recognition with direct segment models,” in Proc. ICASSP. IEEE, 2012, pp. 4161–4164.
-  Y. He and E. Fosler-Lussier, “Efficient segmental conditional random fields for phone recognition,” in Proc. INTERSPEECH, 2012, pp. 1898–1901.
-  H. Tang, W. Wang, K. Gimpel, and K. Livescu, “Discriminative segmental cascades for feature-rich phone recognition,” in Proc. ASRU, 2015.
-  A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. ICASSP. IEEE, 2013, pp. 6645–6649.