1 Introduction
It is typically shown to be beneficial for acoustic modeling to make full use of the future contextual information. In the literature, there are variety of methods to realize this idea for different model architectures. For feedforward neural network (FFNN), this context is usually provided by splicing a fixed set of future frames in the input representation
[1]. It also exists other approaches relating modifying FFNN model structures. The authors in [2, 3] proposed a model called feedforward sequential memory networks (FSMN), which is a standard FFNN equipped with some learnable memory blocks in the hidden layers to encode the long context information into a fixedsize representation. The time delay neural network (TDNN) [4, 5] is another FFNN architecture which has been shown to be effective in modeling long range dependencies through temporal convolution over context.As for unidirectional recurrent neural network (RNN), this is usually accomplished using a delayed prediction of the output labels[6]. However, this method only provides quite limited modeling power of future context, as shown in [7]. While for bidirectional RNN, this is accomplished by processing the data in the backward direction using a separate RNN layer [8, 9, 10]. Although the bidirectional versions have been shown to outperform the unidirectional ones with a large margin [11, 12], the latency of bidirectional models is significantly larger, making them unsuitable for online speech recognition. To overcome this limitation, chunk based training and decoding schemes such as contextsensitivechunk (CSC) [13, 14] and latencycontrolled (LC) BLSTM [11, 15] have been investigated. However, the model latency is still quite high, since in all these online variants, inference is restricted to chunklevel increments to amortize the computation cost of backward RNN. For example, the decoding latency of LCBLSTM in [15] is about 600 ms, which is the sum of chunk size and future context frames . To overcome the shortcomings of the chunkbased methods, Peddinti et al. [7] proposed the use of temporal convolution, in the form of TDNN layers, for modeling the future temporal context while affording inference with framelevel increments. The proposed model is called TDNNLSTM, and is designed by interleaving of temporal convolution (TDNN layers) with unidirectional long shortterm memory (LSTM) [16, 17, 18, 19] layers. This model was shown to outperform bidirectional LSTM in two automatic speech recognition (ASR) tasks, while enabling online decoding with a maximum latency of 200 ms [7].
However, TDNNLSTM’s ability to model the future context comes from the TDNN part, whereas the LSTM itself is incapable of utilizing the future information effectively. In this paper, we attempt to design a RNN acoustic model that can model the future context effectively and directly, without the dependence on extra layers, for instance, TDNN layers. In addition, the model latency and computation cost should be as low as possible.
With this purpose, we choose to use the minimal gated recurrent unit (mGRU) [20] as our base RNN model in this work. mGRU is a revised version of GRU [21, 22] and contains only one multiplicative gate, making the computational cost of mGRU much smaller than GRU and vanilla LSTM [19]
. Based on mGRU, we propose to insert a linear input projection layer to mGRU, getting a model called mGRUIP. The inserted linear projection layer compresses the input vector and hidden state vector simultaneously. Since the size of this layer is much smaller than cell number, mGRUIP contains much less parameters than mGRU. In addition to this, there are two other advantages of the input projection layer. The first one is that inserting this layer is beneficial to the ASR performance. Our experiments on a 309hour Switchboard task show that mGRUIP outperforms mGRU significantly. This finding is consistent with that in LSTM with input projection layer (LSTMIP)
[23].The second (also the most important) advantage is that this input projection forms a bottleneck in the recurrent layer, making it possible to design a module on it, that can utilize the future context information effectively, meanwhile without significantly increasing the model size. In this work, we design two kinds of context modules specifically for mGRUIP, making it capable of modeling future temporal context effectively and directly. The first module is referred to as temporal encoding, in which one mGRUIP layer is equipped with a context block to encode the future context information into a fixedsize representation, similar with FSMN. Temporal encoding is performed at the input projection layer, making the increase of computation cost quite small. The second module borrows the idea from TDNN, and is called temporal convolution as the transforms in it are tied across time steps. In temporal convolution, future context information from several frames is spliced together and compressed by the input projection layer. Thanks to the small dimensionality of the projection, temporal convolution brings quite limited additional parameters. In this work, these two context modules are shown to be quite effective on two ASR tasks, while maintaining low latency (170 ms) online decoding. It is shown that compared with LSTM and mGRU, mGRUIP with temporal convolution provides more than 13% relative WER reduction on the full Switchboard Hub5’00 test set, while on our 1400hour internal Mandarin ASR task, the relative gain is 13% to 24% for different test sets. What’s more, the proposed model outperforms TDNNLSTM with smaller decoding latency and almost half less parameters.
This paper is organized as follows. Section 2 describes the model architecture of GRU and its variants, including the proposed mGRUIP and the two context modules. The related work is introduced in Section 3. We report our experimental results on two ASR tasks in Section 4 and conclude this work in Section 5.
2 Model Architecures
In this section, we will first make a brief introduction to the model structure of GRU and mGRU. Then the proposed mGRUIP and two context modules will be introduced in detail.
2.1 Gru
The GRU model is defined by the following equations (the layer index has been omitted for simplicity):
(1)  
(2)  
(3)  
(4) 
In particular, and are vectors corresponding to the update and reset gates respectively, where
denotes elementwise multiplication. The activations of both gates are elementwise logistic sigmoid functions
, constraining the values of and ranging from 0 to 1. represents the output state vector for the current time frame , while is the candidate state obtained with a hyperbolic tangent. The network is fed by the current input vector (speech features or output vector of previous layer), and the parameters of the model are , , (the feedforward connections), , ,(the recurrent weights), and the bias vectors
, , .2.2 mGRU
mGRU, short for minimal GRU, is a revised version of the GRU described above. It is proposed by [20]
and contains two modifications: removing the reset gate and replacing the hyperbolic tangent function with ReLU activation. Thus it leads to the following update equations:
(5)  
(6)  
(7) 
where BN means batch normalization.
2.3 mGRUIP
In this work, a novel model called mGRUIP is proposed by inserting a linear input projection layer into mGRU. In mGRUIP, the output state vector is calculated from the input vector by the following equations:
(8)  
(9)  
(10)  
(11) 
In mGRUIP, the current input vector and the previous output state vector , are concatenated together and compressed into a lower dimensional projected vector by weight matrices . Then the update gate activation and the candidate state vector are calculated based on the projected vector .
mGRUIP can reduce the parameters of mGRU significantly. The total number of parameters in a standard mGRU network, ignoring the biases, can be computed as follows:
where
is the number of hidden neurons,
the number of input units, and is the total parameter number of mGRU. While for mGRUIP, this value becomes:where is the number of units in the input projection layer. Assuming equal with , the ratio of these two numbers is:
In a typical configuration we can set and , hence the parameters of mGRUIP is just half of mGRU, making the computation quite efficient. Despite this, our experiments on Switchboard task show that mGRUIP outperforms mGRU with the same number of neurons, i.e., . What’s more, increasing while decreasing can further enlarge the gains.
2.4 mGRUIP with Context Module
The input projection layer forms a bottleneck in mGRUIP, making it easier to utilize the future context effectively, in the meantime keep the increase of model size acceptable. In this paper, two kinds of context module, namely temporal encoding and temporal convolution, are specifically designed for mGRUIP.
2.4.1 mGRUIP with Temporal Encoding
In temporal encoding, context information from several future frames are encoded into a fixedsize representation at the input projection layer. Thus equation (8) in a standard mGRUIP now becomes:
(12) 
where the last summation part in equation (12) stands for temporal encoding. In particular, is the input projection vector of layer from the th frame.
is the step stride and
is the order of future context. denotes the transform function applied to . In this work, we tried 3 forms: identity (), scale () and affine transform (). Preliminary results show that the identity function gives slightly better performance than the other two forms. Thus we choose for the rest of this paper. It should be noted that in this case, temporal encoding brings no additional parameters for mGRUIP.2.4.2 mGRUIP with Temporal Convolution
Temporal encoding uses the projection vector of lower layer () to represent the future context, while in temporal convolution, the future information is extracted from the output state vector of lower layer and then compressed by the input projection. Equation (8) now becomes:
(13) 
where the last part represents temporal convolution. In particular, is the output state vector of layer on the th frame. Same as temporal encoding, is the step stride and is the context order. According to this equation, from future frames are spliced together and projected to a lower dimensional space by matrix . Assuming the number of hidden neurons in layer is , temporal convolution brings additional parameters. However, since the value of is usually quite small and we generally splice no more than two frames (), the increase of the model size is limited and acceptable.
3 Related Work
The authors in [23] proposed to insert an input projection layer to vanilla LSTM to reduce the computation cost. In this work, we tried this idea on mGRU[20], getting a model called mGRUIP, which is shown to be more effective and more efficient than mGRU.
TDNNLSTM [7] is one of the most powerful acoustic model that can utilize future context effectively while has relatively low model latency. However, the ability of modeling the future temporal context comes from TDNN and has nothing to do with the LSTM layers. In this work, thanks to the input projection layer, we empower the mGRUIP to be capable of modeling the future context effectively and directly, by equipping it with one of the two proposed context modules, temporal encoding and temporal convolution. These two modules borrows the ideas from FSMN [2, 3] and TDNN [4, 5] respectively. The difference is that, FSMN and TDNN belong to FFNN, therefore both of them need to model the future context as well as the past information to capture the longterm dependencies. Whereas the two proposed context modules are placed in a RNN layer, and they only need to focus on the future context, leaving the history to be modeled by recurrent connections.
Row convolution [24], which encodes future context by applying a contextindependent weight matrix, is another method to model the future context for RNN. The idea is similar with the two proposed context modules. However, row convolution in [24] is only placed above all recurrent layers. While in this work, we place context modules in all hidden layers (except the first one). This layerwise context expansion makes the higher layers having the ability to learn wider temporal relationships than lower layers. What’s more, the objective function is also different: connectionist temporal classification (CTC) [25] in [24] while latticefree MMI (LFMMI) [26] in this work.
4 Experiments
In this section, we evaluate the effectiveness and efficiency of the proposed mGRUIP on two ASR tasks. The first one is the 309hour Switchboard conversational telephone speech task, and the second one is an internal Mandarin voice input task with 1400hour training data. All the models in this paper are trained LFMMI objective function computed on 33Hz outputs [26].
4.1 Switchboard ASR Task
The training data set consists of 309hour SwitchboardI training data. Evaluation is performed in terms of word error rate (WER) on the full Switchboard Hub5’00 test set, consisting of two subsets: Switchboard (SWB) and CallHome (CHE). The experimental setup follows [26]. We use the speedperturbation technique [28] for 3fold data augmentation, and iVectors to perform instantaneous adaptation of the neural network [29]. WER results are reported after 4gram LM rescoring of lattices generated using a trigram LM. For details about the model training, the reader is directed to [26].
4.1.1 Baseline Models
Two baseline models, LSTM and mGRU, are trained for this task. Both of them contain 5 hidden layers, and the cell number for each layer is 1024. For LSTM, we add a recurrent projection layer on top of the memory blocks with a dimension of 512, compressing the cell output from 1024 to 512 dimension. For mGRU, to reduce the parameters of softmax output matrix, we insert a 512dimensional linear bottleneck layer between the last hidden layer and the softmax layer. Both models are trained with an output delay of 50 ms. The input feature to both models at time step
is a spliced version from frame through . Therefore, they both have a model latency of 70 ms. Following [7], we use a mixed frame rate (MFR) across layers. In particular, the first hidden layer is operated at 100Hz frame rate while the rest of higher layers use a frame rate of 33Hz.4.1.2 mGRUIP
To evaluate the effectiveness of the proposed mGRUIP, we train two models containing 5 layers, mGRUIPA and mGRUIPB, with different architectures. In mGRUIPA, each hidden layer consists 1024 cells (, same as the baseline models), and the input projection layer has 512 units (). While for mGRUIPB, the cell number is 2560 and the projection dimension is 256. The training configurations are kept same as the baseline models.
Model  #Param  WER (%)  

(M)  SWB  CHM  Total  
LSTM  19.7  10.3  20.7  15.6 
mGRU  22.1  10.2  20.6  15.5 
mGRUIPA  13.1  9.8  19.0  14.5 
mGRUIPB  16.2  9.7  18.8  14.3 
The performance of the two mGRUIP models and two baseline models is shown in Table 1. We can see that, for these two baseline models, mGRU has more parameters and performs slightly better than LSTM. The proposed model mGRUIPA contains much less parameters than the baseline mGRU (13.1M vs. 22.1M), but performs significantly better on the full test set (14.5 vs. 15.5). This means that the input projection layer can not only reduce the parameter of mGRU, but also being beneficial to the performance. It is also shown that mGRUIPB outperforms mGRUIPA, meaning that we can improve the ASR performance by increasing the cell number, meanwhile without significantly increasing the model size by reducing the projection dimension in mGRUIP. Compared with mGRU, mGRUIPB provides 7.7% relative WER reduction on the full test set whereas using 5.9M less parameters. In the following experiments, we will set and for the mGRUIP related models.
4.1.3 mGRUIP with Context Modules
It’s obvious that temporal encoding and temporal convolution can utilize more future context information by increasing and in equation (12) and (13). However, this will lead to the increase of model latency and model parameters (for temporal convolution). In this work, we did a lot of experiments and found the most costeffective settings for these two context modules are as follows:
Layer  

As shown in Table 2, all the four higher mGRUIP layers (except the first one) are equipped with context modules. The context order for all of them is 1, and the step stride is 3 for the highest three layers while being 1 for the second hidden layer (), making the operating frame rates same as the baselines. After equipped context modules with this setting, the latency of mGRUIP is increased from 70 ms to 170 ms. Table 3 shows the performance of mGRUIP with these two context modules. We also train a TDNNLSTM model following [7], and the results are shown in the second line of Table 3.
Model  #Param  Latency  WER (%)  

(M)  (ms)  SWB  CHM  Total  
LSTM  19.7  70  10.3  20.7  15.6 
TDNNLSTM  34.8  200  9.0  19.7  14.4 
mGRUIPB  16.2  70  9.7  18.8  14.3 
+Ctx Encd  16.2  170  9.5  18.0  13.8 
+Ctx Conv  18.7  170  9.2  17.8  13.5 
MFRBLSTM[7]    2020  9.0    13.6 
TDNNBLSTMC[7]    2130  9.0    13.8 
Several observations can be found in Table 3. First, both of the two context modules can improve the ASR performance of mGRUIP. Temporal convolution is more powerful than temporal encoding, while brings some additional parameters. Second, compared to LSTM, mGRUIPB equipped with temporal convolution provides 13.5% relative WER reduction, with a fraction of the cost of 100 ms additional model latency. Third, mGRUIPB with temporal convolution is more effective than TDNNLSTM on the full test set (13.5 vs. 14.4), with smaller model latency and much less parameters (18.7M vs. 34.8M). What’s more, compared with the two most powerful models in [7] (the last two lines of Table 3), the proposed model outperforms them on the full set with much smaller model latency (170 ms vs. 2000 ms).
4.2 Internal Mandarin ASR Task
The second task is an internal Mandarin ASR task, of which the training set contains 1400 hours mobile recording data. The performance is evaluated on five publicavailable test sets, including three clean and two noisy ones. The three clean sets:
The two noisy test sets are:

THCHS30_Car: the corrupted version of THCHS30_Clean by car noise, the noise level is 0db.

THCHS30_Cafe: the corrupted version of THCHS30_Clean by cafeteria noise, the noise level is 0db.
Three ASR systems are built for this task: LSTM, TDNNLSTM and mGRUIPB with temporal convolution. The model architectures and the training configurations are all the same as Switchboard task. Results are shown in Table 4.
Test  LSTM  TDNNLSTM  mGRUIP  
CER(%)  CERR  
AiShell_dev  5.39  4.81  4.66  13.5% 
AiShell_test  6.62  5.98  5.71  13.8% 
THCHS30_Clean  11.93  10.97  10.38  13.0% 
THCHS30_Car  12.69  11.38  10.77  15.1% 
THCHS30_Cafe  53.19  44.20  40.26  24.3% 
CERR column in Table 4 means the relative CER reduction of mGRUIP over LSTM. It’s shown that mGRUIP performs much better than the baseline LSTM model on this task. On the three clean test sets, the CERR is about 13%, and the gain is even larger on the two very noisy sets, from 15% to 24%.
5 Conclusions
The aim of this paper is to design a RNN acoustic model that being capable of utilizing the future context effectively and directly, with the model latency and computation cost as low as possible. To achieve this goal, we choose the minimal GRU as our base model and propose to insert an input projection layer into it to further reduce the parameters. To model the future context effectively, we design two kinds of context modules, temporal encoding and temporal convolution, specifically for this architecture. Experimental results on the Switchboard task and an internal Mandarin ASR task show that, the proposed model performs much better than LSTM and mGRU models, whereas enables online decoding with a latency of 170 ms. This model even outperforms a very strong baseline, TDNNLSTM, with smaller model latency and almost half less parameters.
References
 [1] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Contextdependent pretrained deep neural networks for largevocabulary speech recognition,” IEEE Transactions on Audio Speech & Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
 [2] S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu, “Feedforward sequential memory networks: A new structure to learn longterm dependency,” Computer Science, 2015.
 [3] S. Zhang, H. Jiang, S. Xiong, S. Wei, and L. R. Dai, “Compact feedforward sequential memory networks for large vocabulary continuous speech recognition,” in INTERSPEECH, 2016, pp. 3389–3393.
 [4] A. W. M. Ieee, T. Hanazawa, G. Hinton, K. S. M. Ieee, and K. J. Lang, “Phoneme recognition using timedelay neural networks,” Readings in Speech Recognition, vol. 1, no. 2, pp. 393–404, 1990.
 [5] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in INTERSPEECH, 2015.
 [6] H. Sak, A. Senior, and F. Beaufays, “Long shortterm memory based recurrent neural network architectures for large vocabulary speech recognition,” Computer Science, pp. 338–342, 2014.
 [7] V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Low latency acoustic modeling using temporal convolution and lstms,” IEEE Signal Processing Letters, vol. PP, no. 99, pp. 1–1, 2017.
 [8] M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks. IEEE Press, 1997.
 [9] A. Graves, S. Fern ndez, and J. Schmidhuber, Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. Springer Berlin Heidelberg, 2005.
 [10] A. Graves, N. Jaitly, and A. R. Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in Automatic Speech Recognition and Understanding, 2014, pp. 273–278.
 [11] Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, and J. Glass, “Highway long shortterm memory rnns for distant speech recognition,” Computer Science, pp. 5755–5759, 2015.
 [12] A. Zeyer, R. Schl ter, and H. Ney, “Towards onlinerecognition with deep bidirectional lstm acoustic models,” in INTERSPEECH, 2016, pp. 3424–3428.
 [13] K. Chen and Q. Huo, Training deep bidirectional LSTM acoustic model for LVCSR by a contextsensitivechunk BPTT approach. IEEE Press, 2016.
 [14] K. Chen, Z. J. Yan, and Q. Huo, “A contextsensitivechunk bptt approach to training deep lstm/blstm recurrent neural networks for offline handwriting recognition,” in International Conference on Document Analysis and Recognition, 2016, pp. 411–415.
 [15] S. Xue and Z. Yan, “Improving latencycontrolled blstm acoustic models for online speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 5340–5344.
 [16] S. Hochreiter and J. Schmidhuber, Long shortterm memory. Springer Berlin Heidelberg, 1997.
 [17] G. F. A., J. Schmidhuber, and F. Cummins, Learning to Forget: Continual Prediction with LSTM. Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale, 1999.
 [18] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in IeeeInnsEnns International Joint Conference on Neural Networks, 2000, pp. 189–194 vol.3.
 [19] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Netw, vol. 18, no. 56, p. 602, 2005.
 [20] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Improving speech recognition by revising gated recurrent units,” INTERSPEECH, pp. 1308–1312, 2017.

[21]
K. Cho, B. V. Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoderdecoder approaches,”
Computer Science, 2014.  [22] J. Chung, C. Gulcehre, K. H. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” Eprint Arxiv, 2014.
 [23] T. Masuko, “Computational cost reduction of long shortterm memory based on simultaneous compression of input and hidden state,” in Automatic Speech Recognition and Understanding, 2017.
 [24] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, and G. Diamos, “Deep speech 2: Endtoend speech recognition in english and mandarin,” in ICML, 2015.

[25]
A. Graves and F. Gomez, “Connectionist temporal classification:labelling
unsegmented sequence data with recurrent neural networks,” in
International Conference on Machine Learning
, 2006, pp. 369–376.  [26] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequencetrained neural networks for asr based on latticefree mmi,” in INTERSPEECH, 2016, pp. 2751–2755.
 [27] K. Vesel , A. Ghoshal, L. Burget, and D. Povey, “Sequencediscriminative training of deep neural networks,” Proc Interspeech, 2013.
 [28] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” Proc Interspeech, 2015.
 [29] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using ivectors,” in Automatic Speech Recognition and Understanding, 2014, pp. 55–59.
 [30] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell1: An opensource mandarin speech corpus and a speech recognition baseline,” 2017.
 [31] Z. Z. Dong Wang, Xuewei Zhang, “Thchs30 : A free chinese speech corpus,” 2015. [Online]. Available: http://arxiv.org/abs/1512.01882