Language Modeling with Deep Transformers

05/10/2019
by   Kazuki Irie, et al.
0

We explore multi-layer autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured Transformer models outperform our baseline models based on the shallow stack of LSTM recurrent neural network layers. We carry out experiments on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level and 10K byte-pair encoding subword-level language modeling. We apply our word-level models to conventional hybrid speech recognition by lattice rescoring, and the subword-level models to attention based encoder-decoder models by shallow fusion. Second, we show that deep Transformer language models do not require positional encoding. The positional encoding is an essential augmentation for the self-attention mechanism which is invariant to sequence ordering. However, in autoregressive setup, as is the case for language modeling, the amount of information increases along the position dimension, which is a positional signal by its own. The analysis of attention weights shows that deep autoregressive self-attention models can automatically make use of such positional information. We find that removing the positional encoding even slightly improves the performance of these models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/04/2021

TransfoRNN: Capturing the Sequential Information in Self-Attention Representations for Language Modeling

In this paper, we describe the use of recurrent neural networks to captu...
research
11/11/2019

Long-span language modeling for speech recognition

We explore neural language modeling for speech recognition where the con...
research
04/08/2021

Revisiting Simple Neural Probabilistic Language Models

Recent progress in language modeling has been driven not only by advance...
research
12/20/2019

Axial Attention in Multidimensional Transformers

We propose Axial Transformers, a self-attention-based autoregressive mod...
research
09/02/2022

Vision-Language Adaptive Mutual Decoder for OOV-STR

Recent works have shown huge success of deep learning models for common ...
research
10/24/2019

An Empirical Study of Efficient ASR Rescoring with Transformers

Neural language models (LMs) have been proved to significantly outperfor...
research
07/03/2017

Multiscale sequence modeling with a learned dictionary

We propose a generalization of neural network sequence models. Instead o...

Please sign up or login with your details

Forgot password? Click here to reset