Conventional automatic speech recognition (ASR) systems using Gaussian mixture model (GMM) and/or hybrid deep neural network (DNN) hidden Markov models (HMM) consist of several components that are trained separately, depend on pretrained alignments and require a complex search[11, 18, 4, 29]. Unlike the conventional approaches, attention-based sequence-to-sequence models propose a standalone and single neural network that trains end-to-end, does not need explicit alignments or context-dependent phonetic labels as in HMM and simplify the inference. In these models, an implicit probabilistic notion of alignment is used as part of a neural network. However, it does not work the same way as its analogy of alignment models in the conventional methods.
The widely used attention-based sequence-to-sequence systems are based on an encoder-decoder architecture, where one or more long short-term memory (LSTM) layers read the observation sequence and another LSTM decodes it to a variable length output sequence of characters or words. In such architectures, both input and output sequences are separately handled as a one-dimensional sequence over time. An attention mechanism is then added into the architecture to combine the encoder and the decoder by allowing the decoder to selectively focus on individual parts of the encoder state sequences [23, 2, 6, 3, 30].
The LSTM  is well suited for sequence modeling, where the sequence is strongly correlated along a one-dimensional time axis. Handling dynamic length, encoding positional information, the ability to make use of the previous context and tracking long-term dependencies by the gating strategy are some of the properties which make LSTM appropriate for the sequence to sequence modeling. Although an LSTM processes essentially one-dimensionally, it can be extended for the processing of multi-dimensional data such as an image or a video .
In this work, we investigate the use of two-dimensional LSTM (2DLSTM) [9, 8] in sequence-to-sequence modeling as an alternative model for the attention component. In this architecture, we apply a 2DLSTM on top of a deep bidirectional encoder to relate input and output representations in a 2D space. One dimension of the 2DLSTM processes the input sequence, and another dimension predicts the output (sub)words. In contrast to the attention-based sequence-to-sequence model, where the encoder states are not updated and the model is not able to re-interpret the encoder states while decoding, this model enables the computation of the encoding of the observation sequence as a function of the previously generated transcribed words. Our model is similar to an architecture used in machine translation described in . We believe that the 2DLSTM is able to capture necessary monotonic alignments as well as retrieve coverage concepts internally by its cell states. Experimental results on the 300h-Switchboard task show competitive performance compared to an attention-based sequence-to-sequence system.
2 Related Works
A way of building multidimensional context into recurrent networks is provided by a strategy that is based on networks with tree-structured update graphs. In handwriting recognition (HWR), 2DLSTM has shown successful results in automatic extraction of features from raw 2D-images over convolutional neural networks (CNNs). In order to investigate deeper and larger models using 2DLSTM, an algorithm to use the GPU power has been implemented .
Different neural networks have been proposed in automatic speech recognition (ASR) to model 2D correlations in the input signal. One of them is a 2DLSTM layer which scans the input over both time and frequency jointly for spatio-temporal modeling and aggregates more variations . Moreover, various architectures to model time-frequency patterns based on deep DNN, CNN, RNN and 2DLSTM layers are compared for large vocabulary ASR .
As an alternative method to the concept of the 2DLSTM, a network of one-dimensional LSTM cells arranged in a multidimensional grid has been introduced . In this topology, the LSTM cells communicate not only along time sequence but also between the layers. The grid LSTM network is also applied for the endpoint detection task in ASR to model both spectral and temporal variations . A 2D attention matrix is also applied in a neural pitch accent recognition model , in which graphemes are encoded in one dimension and audio frames are encoded in the other.
Recently, the 2DLSTM layer also has been used for sequence-to-sequence modeling in machine translation  where it implicitly updates the source representation conditioned on the generated target words. In a similar direction, a 2D CNN-based network has been proposed where the positions of the source and the target words define the 2D grid for translation modeling .
Similar to , we apply a 2DLSTM layer to combine the acoustic model (the LSTM encoder) and the language model (the decoder) without any attention components. The 2DLSTM reconciles the context from both the input and the output sequences and re-interprets the encoder states while a new word has been predicted. Compared to 
, our model is much deeper. We use max-pooling to select the most relevant encoder state whereas uses the last horizontal state of the 2DLSTM. Furthermore, we utilize the same pretraining scheme explained in  during training and a faster decoding.
3 2D Long Short-Term Memory
The 2DLSTM is characterized as a general form of the standard LSTM [9, 15]. It has been proposed to process inherent 2D data of arbitrary lengths, and . Therefore, it uses both horizontal and vertical recurrences. The building block of both the LSTM and the 2DLSTM are shown in Figure 1. At time step , it gets an input , and its computation relies on both the vertical and the horizontal hidden states . Besides the input , the forget and the output gates that are similar to those in the LSTM, the 2DLSTM employs an additional lambda gate. As written in Equation 5, its activation is computed analogously to the other gates [1, 9].
The internal cell state , is computed based on the sum of the two previous cell’s states and , weighted by the lambda gate and its complement (see Equation 3). Similar to the LSTM, the internal cell is combined with the output gate to yield the hidden state. and
are the hyperbolic tangent and the sigmoid functions., and
, are the weight matrices. For notational simplicity, we omit the bias vectors.
4 2D Sequence-to-Sequence Model
Bayes decision rule requires maximization of the class posterior given an input observation. In ASR, classes are discrete label sequences of unknown length (e.g. word, subword, character) sequences, denoted as . Given an input observation of variable length where usually
, the posterior probability of a label sequenceis defined as . This conditional distribution usually covers the alignment information between the input observation sequence and the output word sequence either implicitly or explicitly.
In the attention-based sequence-to-sequence approach, the attention weights serve as the implicit probabilistic notion of alignments aligning output labels to encoder states. The freedom of the attention model to focus on the entire input sequence might contradict monotonicity in ASR. In this work, we remove the attention component and intend to investigate whether the 2D sequence-to-sequence modeling is able to properly capture the input-output monotonic relation.
As shown in Figure 2, we apply a deep bidirectional LSTM encoder () to scan an observation sequence. On top of each bidirectional LSTM layer, we conduct max-pooling over the time dimension to reduce the observation length. Hence, the encoder states are formulated as follows:
where is the reduced length by a reduction factor. Similar to , we then equip the network by a 2DLSTM layer to relate the encoder and the decoder states. At time step , the 2DLSTM receives both the encoder state , and the last target embedding vector , as inputs. One dimension of the 2DLSTM (horizontal-axis in the figure) sequentially reads the encoder states and another (vertical axis) plays the role of the decoder. Therefore, there is no additional decoder LSTM. Unlike the attention-based sequence-to-sequence model, where the encoder states are obtained once at the beginning, our model repeatedly updates the encoder representations , while generating a new output word . We note that in this model, we do not use any attention component. The state of the 2DLSTM is derived as follows:
It is significant to note that the 2DLSTM state for a label/word step only have a dependence on the preceding word sequence , while it takes into account the whole temporal context of the input observation sequence.
At each decoder step, once the whole input sequence is processed from to , we do max-pooling over all horizontal states to obtain the context vector. We have also tried average-pooling or the last horizontal state instead of max-pooling, but none is better in this case. In order to generate a next output word, , a transformation followed by a softmax operation is applied. Therefore:
We have conducted experiments on the Switchboard 300h task. We apply 40-dimensional Gammatone features  using the RASR feature extractor . We use the full Hub5’00 including Switchboard (SWB) and Callhome (CH) as the development set and the Hub5’01 as a test set. In order to enable an open-vocabulary system, we use byte-pair-encoding (BPE)  with 1k merge operations.
As our baseline, we utilize the attention-based sequence-to-sequence architecture similar to that described in  with the exact pretraining scheme and the same reduction factor. The baseline model includes a one-layer LSTM decoder with additive attention equipped with fertility feedback.
The feature vectors are passed into a stack of 6 bidirectional LSTM layers of size 1000 in each direction followed by the max-pooling operation. We downsample the input sequence by factor of 8 in total as described in . The 2DLSTM layer is equipped with 1000 nodes and the output subwords are projected into a 620-dimensional embedding space. The models are trained end to end using the Adam optimizer , dropout of , label smoothing of  and warmup technique. We reduce the learning rate by a factor of 0.7 following a variant of the Newbob scheme based on the perplexity on the development set for a few checkpoints.
In our training, we use layer-wise pretraining for the encoder, where we start with two encoder layers and a single max-pool in between with the same multiple-step reduction factor similar to . Decoding is performed using beam search with a beam size of and the subwords are merged into words. We do not utilize any language model (LM) neither in the baseline system nor in the 2D sequence-to-sequence model. The model is built using our in-house CUDA implementation of 2DLSTM  utilizing optimal speedups in RETURNN 
. The code is open source and the configuration of the setups are available online111https://github.com/rwth-i6/returnn.
Table 1 compares the total number of parameters, perplexity and frame error rate (FER) on the development set between our model and the attention baseline. Both models have the same vocabulary size of almost 1K. Our model has 3M more parameters. The perplexity and the FER are comparable. We also compare our model over prior works based on the WER listed in Table 2. As a simple significance test, the reported WERs are averaged over 3 runs. Although our 2D sequence-to-sequence model is still behind the hybrid methods, it leads to competitive results over the attention baseline. We observe that our model outperforms the baseline on the Hub5’01 subset by absolute. Including a separate LM during the search, we expect to obtain improvements.
We also compare our model and the attention-based sequence-to-sequence model in terms of decoding speed. Based on the fact that the whole output label sequence is known during the training, the entire 2DLSTM states can be computed once and at each time step, one row of it is taken. This computation cannot be done as a single operation in the search since the output sequence has to be predicted; therefore, during the decoding, we need to compute the states of the 2DLSTM row-wise which slows down the search procedure. This algorithm is faster than , where at each output step, they recompute all previous states of 2DLSTM from scratch which are not required. Table 3 lists the decoding speed of the models to decode the entire development set using a single GPU. In general, the decoding speed of our model is about 6 times slower than that of a standard attention-based model.
|model||decoding speed (mins)|
We have applied a simple 2D sequence-to-sequence model as an alternative to the attention-based model. In our model, a 2DLSTM layer has been utilized to jointly combine the input and the output representations. It processes the observation sequence via the horizontal dimension and generates the output (sub)word sequence through the vertical axis. It does not have any additional LSTM decoder and does not benefit from any attention components. Contrary to the attention-based sequence-to-sequence model, it repeatedly re-encodes the encoder representation when a new output (sub)word is generated. The experimental results are competitive with the baseline on the 300h-Switchboard Hub’00 and show improvements on the Hub’01. Our future goal is to develop a bidirectional 2DLSTM to model completely independent of the standard LSTM layers as well as run more experiments on various speech tasks.
This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 694537, project ”SEQCLAS”) and from a Google Focused Award. The work reflects only the authors’ views and none of the funding parties is responsible for any use that may be made of the information it contains.
Towards two-dimensional sequence to sequence model in neural machine translation. arXiv preprint arXiv:1810.03975. Cited by: Figure 1, §1, §2, §2, §3, Figure 2, §4, §5.
-  (2015) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. External Links: Cited by: §1.
-  (2016) End-to-end attention-based large vocabulary speech recognition. In IEEE Inter. Conf. ICASSP, Shanghai, China, Mar. 20-25, 2016, pp. 4945–4949. External Links: Cited by: §1.
-  (2012) Conectionist speech recognition: a hybrid approach. Vol. 247, Springer Science & Business Media. Cited by: §1.
-  (2018) Sequence-to-sequence neural network model with 2d attention for learning japanese pitch accents. In 19th Annual Conf. of Interspeech, Hyderabad, India, 2-6 Sep. 2018., pp. 1284–1287. External Links: Cited by: §2.
-  (2015) Attention-based models for speech recognition. In Annual Conf. NIPS, Dec. 7-12, 2015, Montreal, Quebec, Canada, pp. 577–585. External Links: Cited by: §1.
-  (2018) Pervasive attention: 2d convolutional neural networks for sequence-to-sequence prediction. CoRR abs/1808.03867. External Links: Cited by: §2.
Multi-dimensional recurrent neural networks. CoRR abs/0705.2011. External Links: Cited by: §1.
-  (2008) Supervised sequence labelling with recurrent neural networks. Ph.D. Thesis, Technical University Munich. External Links: Cited by: §1, §1, §3, §3.
-  (1997-11) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Cited by: §1.
-  (1995) Comparison of a new hybrid connectionist schmm approach with other hybrid approaches for speech recognition. In IEEE Inter. Conf. ICASSP, Detroit, Michigan, USA, May 08-12, pp. 3311–3314. External Links: Cited by: §1.
-  (2015) Grid long short-term memory. CoRR abs/1507.01526. External Links: Cited by: §2.
-  (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Cited by: §5.
-  (2016) CITlab ARGUS for historical handwritten documents. CoRR abs/1605.08412. External Links: Cited by: §2.
Cells in multidimensional recurrent neural networks.
The Journal of Machine Learning Research17, pp. 3313–3349. Cited by: §3, §3.
-  (2017) Endpoint detection using grid long short-term memory networks for streaming speech recognition. In In Proc. Interspeech 2017, Cited by: §2.
-  (2016) Exploring multidimensional lstms for large vocabulary ASR. In 2016 IEEE Intern. Conf. ICASSP, Shanghai, China, Mar. 20-25, 2016, pp. 4940–4944. External Links: Cited by: §2.
An application of recurrent nets to phone probability estimation. IEEE Trans. Neural Networks 5 (2), pp. 298–305. External Links: Cited by: §1.
-  (2016) Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In 17th Annual Conf. of the International Speech, Interspeech, San Francisco, CA, USA, Sep. 8-12, 2016, pp. 813–817. External Links: Cited by: §2.
-  (2007) Gammatone features and feature combination for large vocabulary speech recognition. In IEEE Inter. Conf. ICASSP, Honolulu, Hawaii, USA, Apr. 15-20, pp. 649–652. External Links: Cited by: §5.
-  (2016) Neural machine translation of rare words with subword units. In Proc. of the 54th ACL, Aug. 7-12, Berlin, Germany, Volume 1, External Links: Cited by: §5.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. External Links: Cited by: §5.
-  (2014) Sequence to sequence learning with neural networks. In Annual Conf. NIPS, Dec. 8-13, Montreal, Quebec, Canada, pp. 3104–3112. External Links: Cited by: §1.
Rethinking the inception architecture for computer vision. In IEEE Conf. CVPR, Las Vegas, NV, USA, Jun. 27-30, 2016, pp. 2818–2826. External Links: Cited by: §5.
-  (2017) Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition. In 18th Annual Conf. of Interspeech, Stockholm, Sweden, Aug. 20-24, 2017, pp. 3532–3536. External Links: Cited by: Table 2.
-  (2016) Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In 15th Intern. Conf. ICFHR, Shenzhen, China, Oct. 23-26, pp. 228–233. External Links: Cited by: §2, §5.
-  (2014) RASR/NN: the RWTH neural network toolkit for speech recognition. In IEEE Inter. Conf. ICASSP, Florence, Italy, May 4-9, 2014, pp. 3281–3285. External Links: Cited by: §5.
-  (2018) RETURNN as a generic flexible neural toolkit with application to translation and speech recognition. In In Proc. of ACL, Melbourne, Australia, Jul. 15-20, 2018, System Demonstrations, pp. 128–133. External Links: Cited by: §5.
-  (2017) A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition. In IEEE Inter. Conf. ICASSP, New Orleans, LA, USA, Mar. 5-9, 2017, pp. 2462–2466. External Links: Cited by: §1.
-  (2018) Improved training of end-to-end attention models for speech recognition. In 19th Annual Conf. Interspeech, Hyderabad, India, 2-6 Sep. 2018., pp. 7–11. External Links: Cited by: §1, §2, Table 1, Table 2, Table 3, §5, §5, §5.
-  (2017) Advances in all-neural speech recognition. In IEEE Inter. Conf. ICASSP, New Orleans, LA, USA, Mar. 5-9, pp. 4805–4809. External Links: Cited by: Table 2.