On using 2D sequence-to-sequence models for speech recognition

11/20/2019 ∙ by Parnia Bahar, et al. ∙ RWTH Aachen University 0

Attention-based sequence-to-sequence models have shown promising results in automatic speech recognition. Using these architectures, one-dimensional input and output sequences are related by an attention approach, thereby replacing more explicit alignment processes, like in classical HMM-based modeling. In contrast, here we apply a novel two-dimensional long short-term memory (2DLSTM) architecture to directly model the input/output relation between audio/feature vector sequences and word sequences. The proposed model is an alternative model such that instead of using any type of attention components, we apply a 2DLSTM layer to assimilate the context from both input observations and output transcriptions. The experimental evaluation on the Switchboard 300h automatic speech recognition task shows word error rates for the 2DLSTM model that are competitive to end-to-end attention-based model.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conventional automatic speech recognition (ASR) systems using Gaussian mixture model (GMM) and/or hybrid deep neural network (DNN) hidden Markov models (HMM) consist of several components that are trained separately, depend on pretrained alignments and require a complex search

[11, 18, 4, 29]. Unlike the conventional approaches, attention-based sequence-to-sequence models propose a standalone and single neural network that trains end-to-end, does not need explicit alignments or context-dependent phonetic labels as in HMM and simplify the inference. In these models, an implicit probabilistic notion of alignment is used as part of a neural network. However, it does not work the same way as its analogy of alignment models in the conventional methods.

The widely used attention-based sequence-to-sequence systems are based on an encoder-decoder architecture, where one or more long short-term memory (LSTM) layers read the observation sequence and another LSTM decodes it to a variable length output sequence of characters or words. In such architectures, both input and output sequences are separately handled as a one-dimensional sequence over time. An attention mechanism is then added into the architecture to combine the encoder and the decoder by allowing the decoder to selectively focus on individual parts of the encoder state sequences [23, 2, 6, 3, 30].

The LSTM [10] is well suited for sequence modeling, where the sequence is strongly correlated along a one-dimensional time axis. Handling dynamic length, encoding positional information, the ability to make use of the previous context and tracking long-term dependencies by the gating strategy are some of the properties which make LSTM appropriate for the sequence to sequence modeling. Although an LSTM processes essentially one-dimensionally, it can be extended for the processing of multi-dimensional data such as an image or a video [9].

In this work, we investigate the use of two-dimensional LSTM (2DLSTM) [9, 8] in sequence-to-sequence modeling as an alternative model for the attention component. In this architecture, we apply a 2DLSTM on top of a deep bidirectional encoder to relate input and output representations in a 2D space. One dimension of the 2DLSTM processes the input sequence, and another dimension predicts the output (sub)words. In contrast to the attention-based sequence-to-sequence model, where the encoder states are not updated and the model is not able to re-interpret the encoder states while decoding, this model enables the computation of the encoding of the observation sequence as a function of the previously generated transcribed words. Our model is similar to an architecture used in machine translation described in [1]. We believe that the 2DLSTM is able to capture necessary monotonic alignments as well as retrieve coverage concepts internally by its cell states. Experimental results on the 300h-Switchboard task show competitive performance compared to an attention-based sequence-to-sequence system.

(a) LSTM
(b) 2DLSTM
Figure 1: The internal architecture of the standard and the 2DLSTM. The additional connections are marked in blue [1].

2 Related Works

A way of building multidimensional context into recurrent networks is provided by a strategy that is based on networks with tree-structured update graphs. In handwriting recognition (HWR), 2DLSTM has shown successful results in automatic extraction of features from raw 2D-images over convolutional neural networks (CNNs)

[14]. In order to investigate deeper and larger models using 2DLSTM, an algorithm to use the GPU power has been implemented [26].

Different neural networks have been proposed in automatic speech recognition (ASR) to model 2D correlations in the input signal. One of them is a 2DLSTM layer which scans the input over both time and frequency jointly for spatio-temporal modeling and aggregates more variations [17]. Moreover, various architectures to model time-frequency patterns based on deep DNN, CNN, RNN and 2DLSTM layers are compared for large vocabulary ASR [19].

As an alternative method to the concept of the 2DLSTM, a network of one-dimensional LSTM cells arranged in a multidimensional grid has been introduced [12]. In this topology, the LSTM cells communicate not only along time sequence but also between the layers. The grid LSTM network is also applied for the endpoint detection task in ASR to model both spectral and temporal variations [16]. A 2D attention matrix is also applied in a neural pitch accent recognition model [5], in which graphemes are encoded in one dimension and audio frames are encoded in the other.

Recently, the 2DLSTM layer also has been used for sequence-to-sequence modeling in machine translation [1] where it implicitly updates the source representation conditioned on the generated target words. In a similar direction, a 2D CNN-based network has been proposed where the positions of the source and the target words define the 2D grid for translation modeling [7].

Similar to [1], we apply a 2DLSTM layer to combine the acoustic model (the LSTM encoder) and the language model (the decoder) without any attention components. The 2DLSTM reconciles the context from both the input and the output sequences and re-interprets the encoder states while a new word has been predicted. Compared to [1]

, our model is much deeper. We use max-pooling to select the most relevant encoder state whereas

[1] uses the last horizontal state of the 2DLSTM. Furthermore, we utilize the same pretraining scheme explained in [30] during training and a faster decoding.

3 2D Long Short-Term Memory

The 2DLSTM is characterized as a general form of the standard LSTM [9, 15]. It has been proposed to process inherent 2D data of arbitrary lengths, and . Therefore, it uses both horizontal and vertical recurrences. The building block of both the LSTM and the 2DLSTM are shown in Figure 1. At time step , it gets an input , and its computation relies on both the vertical and the horizontal hidden states . Besides the input , the forget and the output gates that are similar to those in the LSTM, the 2DLSTM employs an additional lambda gate. As written in Equation 5, its activation is computed analogously to the other gates [1, 9].


The internal cell state , is computed based on the sum of the two previous cell’s states and , weighted by the lambda gate and its complement (see Equation 3). Similar to the LSTM, the internal cell is combined with the output gate to yield the hidden state. and

are the hyperbolic tangent and the sigmoid functions.

, and

, are the weight matrices. For notational simplicity, we omit the bias vectors.

We process the 2D data in a forward pass from the time step to and thus the gradient is passed backwards in an opposite direction from the time step to . Training a 2DLSTM unit involves back-propagation through two dimensions. For more details, We refer the readers to [9, 15].

4 2D Sequence-to-Sequence Model

Bayes decision rule requires maximization of the class posterior given an input observation. In ASR, classes are discrete label sequences of unknown length (e.g. word, subword, character) sequences, denoted as . Given an input observation of variable length where usually

, the posterior probability of a label sequence

is defined as . This conditional distribution usually covers the alignment information between the input observation sequence and the output word sequence either implicitly or explicitly.

In the attention-based sequence-to-sequence approach, the attention weights serve as the implicit probabilistic notion of alignments aligning output labels to encoder states. The freedom of the attention model to focus on the entire input sequence might contradict monotonicity in ASR. In this work, we remove the attention component and intend to investigate whether the 2D sequence-to-sequence modeling is able to properly capture the input-output monotonic relation.

As shown in Figure 2, we apply a deep bidirectional LSTM encoder () to scan an observation sequence. On top of each bidirectional LSTM layer, we conduct max-pooling over the time dimension to reduce the observation length. Hence, the encoder states are formulated as follows:


where is the reduced length by a reduction factor. Similar to [1], we then equip the network by a 2DLSTM layer to relate the encoder and the decoder states. At time step , the 2DLSTM receives both the encoder state , and the last target embedding vector , as inputs. One dimension of the 2DLSTM (horizontal-axis in the figure) sequentially reads the encoder states and another (vertical axis) plays the role of the decoder. Therefore, there is no additional decoder LSTM. Unlike the attention-based sequence-to-sequence model, where the encoder states are obtained once at the beginning, our model repeatedly updates the encoder representations , while generating a new output word . We note that in this model, we do not use any attention component. The state of the 2DLSTM is derived as follows:


It is significant to note that the 2DLSTM state for a label/word step only have a dependence on the preceding word sequence , while it takes into account the whole temporal context of the input observation sequence.

At each decoder step, once the whole input sequence is processed from to , we do max-pooling over all horizontal states to obtain the context vector. We have also tried average-pooling or the last horizontal state instead of max-pooling, but none is better in this case. In order to generate a next output word, , a transformation followed by a softmax operation is applied. Therefore:

Figure 2: The 2D seq2seq architecture using the 2DLSTM layer on top of -layer of encoder. Neither attention components nor explicit LSTM decoders are used. Inspired by [1].

5 Experiments

We have conducted experiments on the Switchboard 300h task. We apply 40-dimensional Gammatone features [20] using the RASR feature extractor [27]. We use the full Hub5’00 including Switchboard (SWB) and Callhome (CH) as the development set and the Hub5’01 as a test set. In order to enable an open-vocabulary system, we use byte-pair-encoding (BPE) [21] with 1k merge operations.

As our baseline, we utilize the attention-based sequence-to-sequence architecture similar to that described in [30] with the exact pretraining scheme and the same reduction factor. The baseline model includes a one-layer LSTM decoder with additive attention equipped with fertility feedback.

The feature vectors are passed into a stack of 6 bidirectional LSTM layers of size 1000 in each direction followed by the max-pooling operation. We downsample the input sequence by factor of 8 in total as described in [30]. The 2DLSTM layer is equipped with 1000 nodes and the output subwords are projected into a 620-dimensional embedding space. The models are trained end to end using the Adam optimizer [13], dropout of [22], label smoothing of [24] and warmup technique. We reduce the learning rate by a factor of 0.7 following a variant of the Newbob scheme based on the perplexity on the development set for a few checkpoints.

In our training, we use layer-wise pretraining for the encoder, where we start with two encoder layers and a single max-pool in between with the same multiple-step reduction factor similar to [30]. Decoding is performed using beam search with a beam size of and the subwords are merged into words. We do not utilize any language model (LM) neither in the baseline system nor in the 2D sequence-to-sequence model. The model is built using our in-house CUDA implementation of 2DLSTM [26] utilizing optimal speedups in RETURNN [28]

. The code is open source and the configuration of the setups are available online


Table 1 compares the total number of parameters, perplexity and frame error rate (FER) on the development set between our model and the attention baseline. Both models have the same vocabulary size of almost 1K. Our model has 3M more parameters. The perplexity and the FER are comparable. We also compare our model over prior works based on the WER listed in Table 2. As a simple significance test, the reported WERs are averaged over 3 runs. Although our 2D sequence-to-sequence model is still behind the hybrid methods, it leads to competitive results over the attention baseline. We observe that our model outperforms the baseline on the Hub5’01 subset by absolute. Including a separate LM during the search, we expect to obtain improvements.

model # params perplexity FER
baseline [30] 157M 1.56 10.9
this work 160M 1.53 10.6
Table 1: Total number of parameters, perplexity and FER on the development set.
model LM
prior works
    hybrid LSTM 8.3 17.3 12.9
    CTC [31] RNN 14.0 25.3 -
    attention [25] - 23.1 40.8 -
    attention [30] - 13.0 26.2 19.4
this work - 12.9 26.4 19.0

Table 2: WER on Switchboard 300h. average of 3 runs.

We also compare our model and the attention-based sequence-to-sequence model in terms of decoding speed. Based on the fact that the whole output label sequence is known during the training, the entire 2DLSTM states can be computed once and at each time step, one row of it is taken. This computation cannot be done as a single operation in the search since the output sequence has to be predicted; therefore, during the decoding, we need to compute the states of the 2DLSTM row-wise which slows down the search procedure. This algorithm is faster than [1], where at each output step, they recompute all previous states of 2DLSTM from scratch which are not required. Table 3 lists the decoding speed of the models to decode the entire development set using a single GPU. In general, the decoding speed of our model is about 6 times slower than that of a standard attention-based model.

model decoding speed (mins)
baseline [30] 4
this work 26
Table 3: Decoding speed measured in minutes on the entire development set.

6 Conclusion

We have applied a simple 2D sequence-to-sequence model as an alternative to the attention-based model. In our model, a 2DLSTM layer has been utilized to jointly combine the input and the output representations. It processes the observation sequence via the horizontal dimension and generates the output (sub)word sequence through the vertical axis. It does not have any additional LSTM decoder and does not benefit from any attention components. Contrary to the attention-based sequence-to-sequence model, it repeatedly re-encodes the encoder representation when a new output (sub)word is generated. The experimental results are competitive with the baseline on the 300h-Switchboard Hub’00 and show improvements on the Hub’01. Our future goal is to develop a bidirectional 2DLSTM to model completely independent of the standard LSTM layers as well as run more experiments on various speech tasks.

7 Acknowledgements

This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 694537, project ”SEQCLAS”) and from a Google Focused Award. The work reflects only the authors’ views and none of the funding parties is responsible for any use that may be made of the information it contains.


  • [1] P. Bahar, C. Brix, and H. Ney (2018)

    Towards two-dimensional sequence to sequence model in neural machine translation

    arXiv preprint arXiv:1810.03975. Cited by: Figure 1, §1, §2, §2, §3, Figure 2, §4, §5.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. External Links: Link Cited by: §1.
  • [3] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio (2016) End-to-end attention-based large vocabulary speech recognition. In IEEE Inter. Conf. ICASSP, Shanghai, China, Mar. 20-25, 2016, pp. 4945–4949. External Links: Link, Document Cited by: §1.
  • [4] H. A. Bourlard and N. Morgan (2012) Conectionist speech recognition: a hybrid approach. Vol. 247, Springer Science & Business Media. Cited by: §1.
  • [5] A. Bruguier, H. Zen, and A. Arkhangorodsky (2018) Sequence-to-sequence neural network model with 2d attention for learning japanese pitch accents. In 19th Annual Conf. of Interspeech, Hyderabad, India, 2-6 Sep. 2018., pp. 1284–1287. External Links: Link, Document Cited by: §2.
  • [6] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In Annual Conf. NIPS, Dec. 7-12, 2015, Montreal, Quebec, Canada, pp. 577–585. External Links: Link Cited by: §1.
  • [7] M. Elbayad, L. Besacier, and J. Verbeek (2018) Pervasive attention: 2d convolutional neural networks for sequence-to-sequence prediction. CoRR abs/1808.03867. External Links: Link, 1808.03867 Cited by: §2.
  • [8] A. Graves, S. Fernández, and J. Schmidhuber (2007)

    Multi-dimensional recurrent neural networks

    CoRR abs/0705.2011. External Links: Link, 0705.2011 Cited by: §1.
  • [9] A. Graves (2008) Supervised sequence labelling with recurrent neural networks. Ph.D. Thesis, Technical University Munich. External Links: Link Cited by: §1, §1, §3, §3.
  • [10] S. Hochreiter and J. Schmidhuber (1997-11) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §1.
  • [11] H. Hutter (1995) Comparison of a new hybrid connectionist schmm approach with other hybrid approaches for speech recognition. In IEEE Inter. Conf. ICASSP, Detroit, Michigan, USA, May 08-12, pp. 3311–3314. External Links: Link, Document Cited by: §1.
  • [12] N. Kalchbrenner, I. Danihelka, and A. Graves (2015) Grid long short-term memory. CoRR abs/1507.01526. External Links: Link, 1507.01526 Cited by: §2.
  • [13] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Link Cited by: §5.
  • [14] G. Leifert, T. Strauß, T. Grüning, and R. Labahn (2016) CITlab ARGUS for historical handwritten documents. CoRR abs/1605.08412. External Links: Link, 1605.08412 Cited by: §2.
  • [15] G. Leifert, T. Strauß, T. Grüning, W. Wustlich, and R. Labahn (2016) Cells in multidimensional recurrent neural networks.

    The Journal of Machine Learning Research

    17, pp. 3313–3349.
    Cited by: §3, §3.
  • [16] B. Li, C. Parada, G. Simko, S. Chang, and T. Sainath (2017) Endpoint detection using grid long short-term memory networks for streaming speech recognition. In In Proc. Interspeech 2017, Cited by: §2.
  • [17] J. Li, A. Mohamed, G. Zweig, and Y. Gong (2016) Exploring multidimensional lstms for large vocabulary ASR. In 2016 IEEE Intern. Conf. ICASSP, Shanghai, China, Mar. 20-25, 2016, pp. 4940–4944. External Links: Link, Document Cited by: §2.
  • [18] A. J. Robinson (1994)

    An application of recurrent nets to phone probability estimation

    IEEE Trans. Neural Networks 5 (2), pp. 298–305. External Links: Link, Document Cited by: §1.
  • [19] T. N. Sainath and B. Li (2016) Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In 17th Annual Conf. of the International Speech, Interspeech, San Francisco, CA, USA, Sep. 8-12, 2016, pp. 813–817. External Links: Link, Document Cited by: §2.
  • [20] R. Schlüter, I. Bezrukov, H. Wagner, and H. Ney (2007) Gammatone features and feature combination for large vocabulary speech recognition. In IEEE Inter. Conf. ICASSP, Honolulu, Hawaii, USA, Apr. 15-20, pp. 649–652. External Links: Link, Document Cited by: §5.
  • [21] R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proc. of the 54th ACL, Aug. 7-12, Berlin, Germany, Volume 1, External Links: Link Cited by: §5.
  • [22] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. External Links: Link Cited by: §5.
  • [23] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Annual Conf. NIPS, Dec. 8-13, Montreal, Quebec, Canada, pp. 3104–3112. External Links: Link Cited by: §1.
  • [24] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016)

    Rethinking the inception architecture for computer vision

    In IEEE Conf. CVPR, Las Vegas, NV, USA, Jun. 27-30, 2016, pp. 2818–2826. External Links: Link, Document Cited by: §5.
  • [25] S. Toshniwal, H. Tang, L. Lu, and K. Livescu (2017) Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition. In 18th Annual Conf. of Interspeech, Stockholm, Sweden, Aug. 20-24, 2017, pp. 3532–3536. External Links: Link Cited by: Table 2.
  • [26] P. Voigtlaender, P. Doetsch, and H. Ney (2016) Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In 15th Intern. Conf. ICFHR, Shenzhen, China, Oct. 23-26, pp. 228–233. External Links: Link, Document Cited by: §2, §5.
  • [27] S. Wiesler, A. Richard, P. Golik, R. Schlüter, and H. Ney (2014) RASR/NN: the RWTH neural network toolkit for speech recognition. In IEEE Inter. Conf. ICASSP, Florence, Italy, May 4-9, 2014, pp. 3281–3285. External Links: Link, Document Cited by: §5.
  • [28] A. Zeyer, T. Alkhouli, and H. Ney (2018) RETURNN as a generic flexible neural toolkit with application to translation and speech recognition. In In Proc. of ACL, Melbourne, Australia, Jul. 15-20, 2018, System Demonstrations, pp. 128–133. External Links: Link Cited by: §5.
  • [29] A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schlüter, and H. Ney (2017) A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition. In IEEE Inter. Conf. ICASSP, New Orleans, LA, USA, Mar. 5-9, 2017, pp. 2462–2466. External Links: Link, Document Cited by: §1.
  • [30] A. Zeyer, K. Irie, R. Schlüter, and H. Ney (2018) Improved training of end-to-end attention models for speech recognition. In 19th Annual Conf. Interspeech, Hyderabad, India, 2-6 Sep. 2018., pp. 7–11. External Links: Link, Document Cited by: §1, §2, Table 1, Table 2, Table 3, §5, §5, §5.
  • [31] G. Zweig, C. Yu, J. Droppo, and A. Stolcke (2017) Advances in all-neural speech recognition. In IEEE Inter. Conf. ICASSP, New Orleans, LA, USA, Mar. 5-9, pp. 4805–4809. External Links: Link, Document Cited by: Table 2.