Transformer encoder-decoder models 
have become popular in natural language processing. The Transformer architecture allows to successfully train a deep stack ofself-attention layers [2, 3, 4]5] and layer normalization . The positional encodings [1, 7]
, typically based on sinusoidal functions, are used to provide the self-attention with the sequence order information. Across various applications, systematic improvements have been reported over the standard, multi-layer long short-term memory (LSTM) recurrent neural network based models. While originally designed as an encoder-decoder architecture in machine translation, the encoder (e.g., ) and the decoder (e.g., ) components are also separately used in corresponding problems depending on whether the problem disposes the whole sequence for prediction or not.
where such models are investigated for text generation. Recent works on training larger and deeper models[12, 14, 15] have shown further potential of the Transformer in language modeling. On the other hand, an obvious limitation of the Transformers is that their memory requirement linearly increases in terms of number of tokens in the sequence, which requires to work with a limited context window (basically a -gram model where the typical number for is 512) for tasks dealing with long sequences such as character-level language modeling . Dai et al.  has introduced a segment-level recurrence and relative positional encoding in the Transformer language model to be able to potentially handle unlimited context.
In this work, we investigate deep autoregressive Transformers for language modeling in speech recognition. To be specific, we focus on two aspects. First, we revisit the parameter configurations of Transformers, originally engineered for the sequence-to-sequence problem , specifically for language modeling. We conduct experiments on the LibriSpeech automatic speech recognition (ASR) task  for both word-level conventional speech recognition and byte-pair encoding (BPE)  level end-to-end speech recognition [18, 19]. We apply our word-level models to hybrid speech recognition by lattice rescoring , and the BPE-level models to end-to-end models by shallow fusion [21, 22]. We show that well configured Transformer language models outperform models based on the simple stack of LSTM RNN layers in terms of both perplexity and word error rate (WER).
Second, we experimentally show that the positional encoding is not needed for multi-layer autoregressive self-attention models. The visualization of the attention weights shows that when the sinusoidal positional encoding is provided with the input, the first layer of the Transformers learns to extract -gram features (therefore making use of positional information). However, in the autoregressive problem where a new token is provided to the model at each time step, the amount of information the model has access to strictly increases from left to right at the lowest level of the network, which should provide some positional information by its own. We observe that deep Transformer language models without positional encoding automatically make use of such information, and even give slight improvements over models with positional encodings.
2 Related Work
The first part of our work follows the spirits of Al-Rfou et al.’s work  and Radford et al.’s work [14, 15] in investigating larger and deeper Transformers for language modeling. We show that deep Transformer language models can be successfully applied to speech recognition and give good performance. The second part of this work concerns the positional encoding, which is a crucial component in the original Transformer. There are active investigations on positional encoding variants to improve self-attention (e.g., [23, 11]). Previous works in Transformer language models systematically use positional encoding, either jointly learned one or the sinusoidal one (both cases are reported to give similar performance in ). We show that the deep autoregressive self-attention models do not require any explicit model for encoding positions to give the best performance.
3 Autoregressive Self-Attention
The language model we consider is based on the decoder component of the Transformer architecture .
Similar to previous work [10, 14, 12, 11, 15],
we define layer as a stack of two modules: self-attention and feed-forward111Typically called
position-wise feed-forward module . Here we omit position-wise as it is obvious for autoregressive models.
as it is obvious for autoregressive models.modules.
The autoregressive self-attention module in the -th layer transforms the input at position as follows:
where , , , respectively denote query, key, value projection matrices, denotes layer normalization , denotes the scaled multi-head dot product self-attention , and denotes the projection matrix for the residual connection .
The output is then fed to the feed-forward module:
The input of the network consists of the sum of the token embedding (word or BPE in this work) and the sinusoidal positional encoding as specified in can be seen as states of the Transformer model222In principle, we could also consider an autoregressive self-attention model which updates states at all predecessor positions for each new input, which would be then much more computationally inefficient. (whose size, as opposed to the standard RNN states, linearly grows as we progress along the position dimension). During inference, these states are stored to avoid redundant computation. During training, the computation along the position dimension is parallelized for speed-up.
4 LibriSpeech Dataset
4.1 Language Modeling Data Descriptions
The LibriSpeech datasets  for language modeling consists of 800M-word text only data and 960hr of audio transcriptions which corresponds to 10M-word text data. Based on analysis of count model perplexities, we observe that the audio transcription part does not contain special domain signal which matches the development set. Therefore, we simply merge the two datasets to form one training dataset for language model training. The average sentence length in the resulting training data is 21 words with the maximum length of 600 words. The development and test sets respectively have two parts: dev-clean, dev-other, test-clean, and test-other. This separation is based on the audio-level characteristics, therefore it has no special meaning for language modeling. In the experimental section, we denote by ”Dev” and ”Test” the concatenation of clean and other parts of the respective data. Both datasets consist of about 110K running words with average of 20 words per sentence. The word-level vocabulary contains 200K words.
4.2 4-gram count and LSTM-RNN Baselines
We use the official 4-gram count language model provided with the LibriSpeech dataset . No improvement in perplexity is observed when going up to 5-grams. For LSTM-RNN language models , we first train our base configuration; the model has 2 LSTM-RNN layers with dimension 2048 and the input projection layer of 128, where the dropout with a rate of 0.2 is applied between each layer. Since we observe that this model underfits the LibriSpeech training set, we further train two models; the same model without dropout, and the one with 4 LSTM layers stacked without dropout. Both these changes effectively give better perplexity, though our current experiments did not show improvements from simply stacking more LSTM layers. The perplexities of these models are summarized in Table 1. We observe that a good relative improvements of greater than 57% is obtained by LSTM language models over the 4-gram model.
5 Text based Experiments
We carry out experiments for both word-level and BPE-level language modeling. We first focus on the word-level one.
5.1 Hyper-parameters in Transformers
The Transformer architecture presents a new search space Odyssey . The exhaustive model hyper-parameters for Transformer language models specified by the equations in Sec. 3 are the number of layers and the dimension of the residual connection, and for each layer the number of attention heads, the dimension of the key and query, the dimension of the value, and the dimension of the feed-forward layer.
In our experiments, we use the same dimension for key, query and value, as well as the residual connection. We use the same dimensionality across all layers. Therefore, our models can be fully specified by the tuple (number of layers , feed-forward dimension , residual dimension , number of heads
). We do not apply any regularization method including dropout. We train all models using the plain stochastic gradient descent and new-bob learning rate tuning on a single GPU. We define our training sub-epoch (for new-bob) as the 10th of the full training data. All our implementations are based on the Tensorflow based open-source toolkit RETURNN 333We will make training config files and models available at https://github.com/rwth-i6/returnn-experiments/tree/master/2019-lm-transformers..
5.2 Hyper-parameter Tuning
Given the amount of LibriSpeech training data (810M words), it is unreasonable to train all model variants until full convergence. The earlier stage of the training already consistently indicates the performance of the models. Therefore, we carry out comparisons between models with different configuration at the equal, large enough, but reasonable number of updates.
The first set of comparison investigates the effect of depth and width. The perplexity results can be found in Table 2. All models in the table use 8 attention heads. Other parameters are specified in the table. The table is organized in two parts. The upper part of Table 2 shows the effect of number of layers; we observe that increasing number of layers (therefore the number of parameters) from 1 to 42 gradually improves the perplexity. In the lower part of Table 2 , we vary both the number of layers, feed-forward dimension, and the residual dimension. First of all, the 12-layer model outperforms the 6-layer model, while having similar number of parameters, which seems to indicate that the depth effectively benefits Transformer language models. We also train an extreme model which has only 2 layers with wide dimensions . The number of parameters in fact blows up because of the large value of which results in a large matrix in the output softmax layer with 200K vocabulary444 We note that this is also the reason why the number of parameters of our baseline LSTM language models in Table 1 is relatively high.. We observe that such wide but shallow models do not perform well.
However, the softmax bottleneck dimension for language modeling typically needs to be large for the best performance . In Transformers, the bottleneck dimension corresponds to the residual connection dimension which is typically kept rather small (typically 512 or 1024). As a control experiment, we also train a model in which we insert an additional projection layer with a large dimension before the softmax layer to give larger bottleneck capacity. Table 3 shows the comparison conducted on the (12, 2048, 512, 8) model. We observe that simply enlarging the bottleneck dimension does not improve Transformer models.
shows the effect of number of attention heads. 16 heads which is the largest number we try in this setup give the best performance. In addition, we examine the type of activation function (Table5). As opposed to previous work on feed-forward language models using GLUs [26, 32]
, we do not observe faster convergence. As we observe that the impact of choice of activation functions on the perplexity is overall limited, all our other models use the standard ReLU. As reported in the original Transformer, we confirm that both layer normalization and residual connections are needed for these models for stable training555We tried to train multiple models without either residual connections or layer normalization. Also, following , we tried reorganizing the feed-forward module to insert one additional pre-activation layer normalization  and one more activation function. However, we did not observe any improvement. The original Transformers anyway do not have any activation on the residual path throughout the whole network..
5.3 Parameter Tying
Dehghani et al.  reports Universal Transformers to perform particularly well
for language modeling. This motivates us to experiment with Transformer models which share the parameters across layers.
For a Universal Transformer to have comparable number of parameters with the standard deep Transformers,
the dimensions in each layer must be increased, which results in slower training; here we simply investigate the effect
of number of recurrence.
Table 7 shows the perplexity results.
First of all, we observe that the model performance is behind that of the standard Transformer666 We note that here the direct comparison is not as straightforward as between the standard Transformers. In fact, we observe that the training hyperparameters tuned
for the standard Transformers can not be directly applied to Universal Transformers; specifically, we find it crucial to reduce the gradient norm clipping threshold from 1 to 0.1, which is potentially slowing down the convergence.
We note that here the direct comparison is not as straightforward as between the standard Transformers. In fact, we observe that the training hyperparameters tuned for the standard Transformers can not be directly applied to Universal Transformers; specifically, we find it crucial to reduce the gradient norm clipping threshold from 1 to 0.1, which is potentially slowing down the convergence.. However, we clearly observe that increasing the number of layers from 3 to 6 consistently improves the perplexity. This improvement without additional parameters motivates future work to investigate further parameter sharing strategies for Transformers.
6 ASR Experiments
6.1 Lattice Rescoring Results
We apply our word-level Transformer language models to conventional hybrid speech recognition by lattice rescoring. The standard push-forward lattice rescoring algorithm  for long-span language models can be directly applied to self-attention based models. The only modifications from the RNN version is to define the ”state” as all hidden states ( in Sec.3) in all layers from all predecessor positions and the current position (; for position encoding). Table 8 shows the WERs and perplexities (PPL). Our baseline acoustic model is based on multi-layer bi-directional LSTM . Further descriptions of our baseline acoustic model can be found in . We obtain consistent improvements in WER over the LSTM baselines.
6.2 End-to-End ASR Shallow Fusion Results
We train 10K BPE-level Transformer language models to be combined with attention-based encoder-decoder speech model by shallow fusion [21, 22]. The 10K BPE level training data has a longer average length of 24 tokens per sentence with the longest sentence length of 1343, which is still manageable without any truncation for self-attention. We use the Transformer architecture of (24, 2048, 512, 8). The two-layer LSTM architecture is also the same as described in 4.2 without dropout. We refer to our previous work  for the description of the baseline attention model; the better baseline WERs than our previous work  are obtained by improved curriculum learning and longer training. Table 9 shows both perplexities and WERs 777Following , we introduced an end-of-sentence penalty and a larger beam size for shallow fusion which gave direct improvements in WERs; we expect further improvements from tuning of these parameters.. Again, we obtain consistent improvements over the LSTM baseline. These results are slightly better than previously reported best WERs [37, 38, 39] for end-to-end models without data augmentation.
Compared with hidden states in RNNs, attention weights are easier to be visualized, which gives opportunity for analysis. In particular, we focus on the comparison of the Transformer language models with and without positional encoding.
7.1 Transformer LM without positional encoding
In the autoregressive problem where a new token is provided to the model at each time step, the amount of information the model has access to strictly increases from left to right at the lowest level of the network; the deeper layers should be able to recognize this structure which should provide the model with some positional information by its own. To check this hypothesis, we train models without any positional encoding. First, we observe that they give better perplexities than the models with sinusoidal positional encoding (Table 10).
7.2 First layer
The attention in the first layer is the most straightforward for interpretation because the feature at each position exactly corresponds to the word at the position (while deeper layers can potentially shuffle the feature content). The attention weights in the first layer of 24-layer Transformer language models with and without positional encodings are visualized in Figure 1. We observe that the first layer of the model with positional encoding (Figure 1(a)) learns to create n-gram features (roughly 2 or 3-gram), which indicates that the positional information is directly used. In contrast, the first layer of the model without positional encoding learns to focus on the new input token as can be seen as the diagonal in Figure 1(b) (interestingly, we also see that it ignores some functional words such as ”the”, ”and”, ”to” which might be modeled by some off-set values, therefore attending to the beginning of sentence token instead), which demonstrates that the model is aware of the position of the new input.
7.3 Other layers
We observe that the behavior of other layers are rather similar for both Transformer models with and without positional encoding. We find 3 categories of layers in the other 23 layers; the second and third layers are ”blur” layers as shown in Figure 1(c), which seems to roughly average over all positions (while we can also see that some heads focus on difficult words, here ”verandah”). Layer 4 to 9 are window layers which focus on the local n-gram. A representative example is show in Figure 1(d). Finally, we find the top layers 10 to 23 to be more structured, attending to some specific patterns; an example is shown in Figure 1(e).
We apply deep Transformer language models for speech recognition. We show that such models outperform the shallow stack of LSTM-RNNs on both word-level and BPE-level modeling. Future work investigates application of crucial components of deep Transformers (such as layer normalization) to deeper LSTM models; e.g., the RNMT+ decoder architecture  for language modeling. Furthermore, we do not apply any regularization on models for the large LibriSpeech task, as no overfitting is observed in the range of model sizes we experimented with (for the word-level models). We can possibly still improve our models simply by scaling up their size and using regularization.
This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 694537, project ”SEQCLAS”) and from a Google Focused Award. The work reflects only the authors’ views and none of the funding parties is responsible for any use that may be made of the information it contains. We thanks Liuhui Deng for contributing to our lattice rescoring code based on Tensorflow C++ API, Arne Nix and Julian Schamper for sharing their base Transformer configs, and Eugen Beck and Christoph Lüscher for help with generating lattices.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, Dec. 2017, pp. 5998–6008.
-  J. Cheng, L. Dong, and M. Lapata, “Long short-term memory-networks for machine reading,” in Proc. Conf. on Empirical Methods in Nat. Lang. Processing (EMNLP), Austin, TX, USA, Nov. 2016, pp. 551–561.
-  Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured self-attentive sentence embedding,” Int. Conf. on Learning Representations (ICLR), Apr. 2017.
-  A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit, “A decomposable attention model for natural language inference,” in Proc. Conf. on Empirical Methods in Nat. Lang. Processing (EMNLP), Austin, TX, USA, Nov. 2016, pp. 2249–2255.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
IEEE Conf. on Computer Vision and Patt. Recog. (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 770–778.
-  J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
-  J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in Proc. ICML, Sydney, Australia, Aug. 2017, pp. 1243–1252.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. North American Chap. of the Assoc. for Comput. Ling. on Human Lang. Tech. (NAACL-HLT), Minneapolis, USA, Jun. 2019.
-  P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, Ł. Kaiser, and N. Shazeer, “Generating wikipedia by summarizing long sequences,” in Int. Conf. on Learning Representations (ICLR), Vancouver, Canada, Apr. 2018.
-  Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive language models beyond a fixed-length context,” arXiv preprint arXiv:1901.02860, 2019.
-  R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones, “Character-level language modeling with deeper self-attention,” in Proc. AAAI Conf. on Artif. Int., Honolulu, HI, USA, Jan. 2019.
-  A. Baevski and M. Auli, “Adaptive input representations for neural language modeling,” in ICLR, New Orleans, LA, USA, May 2019.
-  A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving language understanding by generative pre-training,” [Online]. : https://blog.openai.com/language-unsupervised/, 2018.
-  A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” [Online]. : https://blog.openai.com/better-language-models/, 2019.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Queensland, Australia, Apr. 2015, pp. 5206–5210.
R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” inProc. ACL, Berlin, Germany, August 2016, pp. 1715–1725.
-  W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: a neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP, Shanghai, China, Mar. 2016, pp. 4960–4964.
-  A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved training of end-to-end attention models for speech recognition,” in Proc. Interspeech, Hyderabad, India, Sep. 2018, pp. 7–11.
-  M. Sundermeyer, Z. Tüske, R. Schlüter, and H. Ney, “Lattice decoding and rescoring with long-span neural network language models,” in Interspeech, Singapore, Sep. 2014, pp. 661–665.
-  Ç. Gülçehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, and Y. Bengio, “On using monolingual corpora in neural machine translation,” Computer Speech & Language, vol. 45, pp. 137–148, Sep. 2017.
-  S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, and K. Livescu, “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in Proc. SLT, Athens, Greece, Dec. 2018.
-  P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” in Proc. NAACL, New Orleans, LA, USA, Jun. 2018, pp. 464–468.
Proc. Int. Conf. on Machine Learning (ICML), Haifa, Israel, Jun. 2010, pp. 807–814.
-  D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” arXiv preprint arXiv:1606.08415, 2018.
-  Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proc. ICML, Sydney, Australia, Aug. 2017, pp. 933–941.
-  M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural networks for language modeling.” in Proc. Interspeech, Portland, OR, USA, Sep. 2012, pp. 194–197.
-  K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “LSTM: A search space odyssey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 10, pp. 2222–2232, 2017.
-  M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, and M. D. et al., “Tensorflow: A system for large-scale machine learning,” in Proc. USENIX Sympo. on Operating Systems Design and Impl. (OSDI 16), Savannah, GA, USA, Nov. 2016, pp. 265–283.
-  A. Zeyer, T. Alkhouli, and H. Ney, “RETURNN as a generic flexible neural toolkit with application to translation and speech recognition,” in Proc. Assoc. for Computational Linguistics (ACL), Melbourne, Australia, Jul. 2018.
-  Z. Yang, Z. Dai, R. Salakhutdinov, and W. W. Cohen, “Breaking the softmax bottleneck: A high-rank RNN language model,” in ICLR, Vancouver, Canada, Apr. 2018.
-  K. Irie, Z. Lei, R. Schlüter, and H. Ney, “Prediction of LSTM-RNN full context states as a subtask for N-gram feedforward language models,” in ICASSP, Calgary, Canada, Apr. 2018, pp. 6104–6108.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proc. European Conf. on Computer Vision (ECCV), Amsterdam, Netherlands, Oct. 2016, pp. 630–645.
-  M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser, “Universal Transformers,” in Int. Conf. on Learning Representations (ICLR), New Orleans, LA, USA, May 2019.
-  A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schlüter, and H. Ney, “A comprehensive study of deep bidirectional lstm rnns for acoustic modeling in speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, Mar. 2017, pp. 2462–2466.
-  C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney, “RWTH ASR systems for LibriSpeech: Hybrid vs Attention,” in Submitted to Interspeech 2019, Graz, Austria, Sep. 2019.
-  A. Hannun, A. Lee, Q. Xu, and R. Collobert, “Sequence-to-sequence speech recognition with time-depth separable convolutions,” arXiv preprint arXiv:1904.02619, 2019.
-  N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve, and R. Collobert, “Fully convolutional speech recognition,” arXiv preprint arXiv:1812.06864, 2018.
-  K. Irie, R. Prabhavalkar, A. Kannan, A. Bruguier, D. Rybach, and P. Nguyen, “Model unit exploration for sequence-to-sequence speech recognition,” preprint arXiv:1902.01955, 2019.
-  M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, Z. Chen, Y. Wu, and M. Hughes, “The best of both worlds: Combining recent advances in neural machine translation,” in ACL, Melbourne, Australia, Jul. 2018, pp. 76–86.