Log In Sign Up

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

by   Ching-Feng Yeh, et al.

We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer and reduce computational complexity. All experiments are conducted on the public LibriSpeech corpus. The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 streamable, compact with 45.7M parameters for the entire system, and computationally efficient with complexity of O(T), where T is input sequence length.


page 1

page 2

page 3

page 4


Simplified Self-Attention for Transformer-based End-to-End Speech Recognition

Transformer models have been introduced into end-to-end speech recogniti...

Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention

Self-attention (SA), which encodes vector sequences according to their p...

A Transformer with Interleaved Self-attention and Convolution for Hybrid Acoustic Models

Transformer with self-attention has achieved great success in the area o...

Efficient conformer-based speech recognition with linear attention

Recently, conformer-based end-to-end automatic speech recognition, which...

Transformer-based End-to-End Speech Recognition with Local Dense Synthesizer Attention

Recently, several studies reported that dot-product selfattention (SA) m...

Points to Patches: Enabling the Use of Self-Attention for 3D Shape Recognition

While the Transformer architecture has become ubiquitous in the machine ...

Exploring Attention Map Reuse for Efficient Transformer Neural Networks

Transformer-based deep neural networks have achieved great success in va...

1 Introduction

There has been significant progress on automatic speech recognition (ASR) technologies over the past few years due to the adoption of deep neural networks

[1]. Conventionally, speech recognition systems involve individual components for explicit modeling on different levels of signal transformation: acoustic models for audio to acoustic units, pronunciation model for acoustic units to words and language model for words to sentences. This framework is often referred to as the “traditional” hybrid system. Conventionally, individual components in the hybrid system can be optimized separately. For example, CD-DNN-HMM [1]

focuses on maximizing the likelihood between acoustic signals and acoustic models with frame-level alignments. For language modeling, both statistical n-gram models

[2] and more recently, neural-network-based models [3] aim to model purely the connection between word tokens.

Hybrid systems achieved significant success [4] but also present challenges. For example, hybrid system requires more human intervention in the building process, including the design of acoustic units, the vocabulary, the pronunciation model and more. In addition, an accurate hybrid system often comes with the cost of higher computational complexity and memory consumption, thus increasing the difficulty of deploying hybrid systems in resource-limited scenarios such as on-device speech recognition. Given the challenges, the interests in end-to-end approaches for speech recognition have surged recently [5, 6, 7, 8, 9, 10, 11, 12]. Different from hybrid systems, end-to-end approaches aim to model the transformation from audio signal to word tokens directly, therefore the model becomes simpler and requires less human intervention. In addition to the simplicity of training process, end-to-end systems also demonstrated promising recognition accuracy [11]

. Among many end-to-end approaches, recurrent neural network transducer (RNN-T)

[5, 6] provides promising potential on footprint, accuracy and efficiency. In this work, we explore options for further improvements based on RNN-T.

Recurrent neural networks (RNNs) such as long-short term memory (LSTM)

[13] networks are good at sequence modeling and widely adopted for speech recognition. RNNs rely on the recurrent connection from the previous state to the current state to propagate contextual information. This recurrent connection is effective but also presents challenges. For example, since depends on , RNNs are difficult to compute in parallel. In addition,

is usually of fixed dimensions, which means all historical information is condensed into a fixed-length vector and makes capturing long contexts also difficult. The attention mechanism

[14, 15] was introduced recently as an alternative for sequence modeling. Compared with RNNs, the attention mechanism is non-recurrent and can compute in parallel easily. In addition, the attention mechanism can ”attend” to longer contexts explicitly. With the attention mechanism, the Transformer model [14] achieved state-of-the-art performance in many sequence-to-sequence tasks [15, 16].

In this paper, we explore options to apply Transformer networks in the neural transducer framework. VGG networks [17] with causal convolution are adopted to incorporate contextual information into the Transformer networks and reduce the frame rate for efficient inference. In addition, we use truncated self-attention to enable streaming inference and reduce computational complexity .

2 Neural Transducer (RNN-T)

In nature, speech recognition is a sequence-to-sequence (audio-to-text) task in which the lengths of input and output sequences can vary. As an end-to-end approach, connectionist temporal classification (CTC) [9] was introduced before RNN-T to model such sequence-to-sequence transformation. Given input sequence , where and is the input sequence length, output sequence , where represent output symbols and is the output sequence length, CTC introduces an additional ”blank” label

and models the posterior probability of

given by:


where correspond to any possible paths such that after removing and repeated consecutive symbols of yields .

The formulation of CTC assumes that symbols in the output sequence are conditionally independent of one another given the input sequence. The RNN-T model improves upon CTC by making the output symbol distribution at each step dependent on the input sequence and previous non-blank output symbols in the history:


where correspond to any possible paths such that after removing and repeated consecutive symbols of yields . By explicitly conditioning the current output on the history, RNN-T outperforms CTC when no external language model is present [6, 7]. RNN-T can be implemented in the encoder-decoder framework, as illustrated in Fig. 1. The encoder encodes the input acoustic sequence to with potential subsampling . And the decoder contains a predictor to encode the previous non-blank output symbol

for the logits

to condition on. It’s worth noting that only when the most probable symbol is non-blank the input to predictor will be updated, so that the conditioning encoding only changes when non-blank output symbols are observed. From the illustration, we see that RNN-T incorporates a language model of output symbols internally in the decoder.

Figure 1: Neural Transducer.

There are many architectures that can be used as encoders and predictors. The functionality of these blocks is to take a sequence and find a higher-order representations. Recurrent neural networks (RNNs) such as LSTM [13] have been successfully used for such functionality. In this paper, we explore Transformer [14, 15] as an alternative for sequence encoding in RNN-T. Since Transformer is not recurrent in nature, we refer to the architecture illustrated in Fig. 1 as simply ”neural transducer” [18] for the rest of the paper.

3 Transformer

The attention mechanism [19] is one of the core ideas of Transformer [15]. It was proposed to model correlation between contextual signals and produced state-of-the-art performance in many domains including machine translation [15]

and natural language processing

[14]. Similar to RNNs, attention mechanism aims to encode the input sequence to a higher-level representation by formulating the encoding function into the relationship between queries , keys and values and describing the similarities between them with:


where , and . This mechanism becomes ”self-attention” when . A self-attention block encodes the input to a higher-level representation , just like RNNs but without recurrence. Compared with RNNs where depends on , self-attention has no recurrent connections between time steps in the encoding , therefore it can generate encoding efficiently in parallel. In addition, compared with RNNs where contexts are condensed into fixed-length states for the next time step to condition on, self-attention ”pays attention” to all available contexts to better model the context within the input sequence.

3.1 Multi-Head Self-Attention

The attention mechanism can be further extended to multi-head attention, in which 1) dimensions of input sequences are split into multiple chunks with multiple projections 2) each chunk goes through independent attention mechanisms 3) encodings from each chunks are concatenated then projected to produce the output encodings, as described with:


where is the number of heads, is the dimension of input sequence, , is the encoding generated by head , , , and . Multi-head attention integrates encodings generated from multiple subspaces to higher-dimensional representations [15].

3.2 Transformer Encoder

The Transformer [14] is also a sequence-to-sequence model. The architecture of the Transformer encoder contains three main blocks: 1) attention block, 2) feed-forward block and 3) layer norm [20] as shown in Fig. 2(a). The attention block contains the core multi-head self-attention component. The feed-forward block projects the input dimension to another feature space and then back to (usually ) for learning feature representation. The final layer normalization and other additional components including layer norm and dropout in the first two blocks are added to stabilize the model training and prevent overfitting. Furthermore, we use VGGNets to incorporate positional information into the Transformer as illustrated in Fig. 2(b). More details are given in section 4.1.

Figure 2: (a) Transformer Encoder (b) VGG-Transformer.

4 Transformer-Transducer

Given the success of the Transformer, we explore the options of applying Transformer in neural transducer. For further improvement, we propose 1) using causal convolution for context modeling and frame rate reduction and 2) using truncated self-attention to reduce the computational complexity and enable streaming for Transformer.

4.1 Context Modeling with Causal Convolution

Transformer relies on multi-head self-attention to model the contextual information. However, attention mechanism is non-recurrent and non-convolutive, therefore risks losing the order or positional information in the input sequence [21, 15], which could harm the performance especially for the case of language modeling. To incorporate the positional information into Transformer, a simple way is adding positional encoding [15] but convolutional approaches [8] demonstrated superior performance. In this paper we adopt the convolutional approach in [8] with modification.

Figure 3: (a) Causal Convolution (b) VGGNet

Convolution networks model contexts by using kernels to convolve blocks of features. If we treat the input sequence (for example: acoustic features) as a two-dimensional image , in common practice for a kernel the convolution would cover from to to produce the convolved output . Therefore the convolution would need ”future” information to generate the encoding for the current time step. For acoustic modeling this introduces additional look ahead and latency, but introducing future information is impractical for language modeling since the next symbol is unknown during inference.

To prevent future information from leaking into the computation at the current time step, we use causal convolution in which all contexts required are pushed to the history, as illustrated in Fig. 3(a). With causal convolution, for a kernel the convolution covers from to to produce the convolved output , therefore ensuring the convolution is purely ”causal”. Similar to [8], we also adopt the VGGNet [17] structure, as illustrated in Fig. 3

(b), where two two-dimensional convolution layers are stacked sequentially followed by a two-dimensional max-pooling layer. We use layers of the causal VGGNet to incorporate positional information and propagate to the succeeding Transformer encoder layers. We refer to this network as ”VGG-Transformer” and illustrate the architecture used for the encoder in neural transducer in Fig.

3(b), where the first two VGGNet layers are used to incorporate positional information and reduce the frame rate for efficient inference, followed by a linear layer for dimension reduction and multiple Transformer encoder layers for generating higher-level representations.

4.2 Truncated Self-Attention

Unlimited self-attention attends to the whole input sequence and poses two issues: 1) streaming inference is disabled and 2) computational complexity is high. As illustrated in Fig. 4(a), for unlimited self-attention, the output at time step depends on the entire input sequence , meaning the inference can only begin after the final length is known. In addition, depends on the similarity pairs , giving complexity for computing . These issues are critical for self-attention to work in scenarios demanding low-latency and low-computation such as on-device speech recognition [6].

Figure 4: Self-Attention: (a) Unlimited (b) Truncated.

To reduce both the latency and computational cost, we replace the unlimited self-attention by truncated self-attention, as illustrated in Fig. 4(b). Similar to time-delayed neural network (TDNN) [22, 23], we limit the contexts available for self-attention so that output at time only depends on . Compared with unlimited self-attention, truncated self-attention is both streamable and computationally efficient. The look-ahead is the right context and the computational complexity reduces from to . However, it also comes with potential performance degradation and is investigated further in experiments.

5 Experiments

5.1 Corpus and Setup

We use the publicly-available, widely-used LibriSpeech corpus [24] for experiments. LibriSpeech comes with 960 hours of read speech data for training, and 4 sets {dev, test}-{clean,other} for fine-tuning and evaluations. The clean sets contain high quality utterances where as the other sets are more acoustically challenging. We use dev-{clean,other} sets to fine-tune parameters for beam search and report results on test-{clean,other} results. We extract 80-dimensional log Mel-filter bank features every 10ms as acoustic features and normalize them with global mean computed from the training set. We also apply SpecAugment [25] with policy ”LD” for data distortion. A sentence piece model [26]

with 256 symbols is trained from transcriptions of the training set and serves as the output symbols. For each model, we use a learnable embedding layer to convert symbols to 128-dimensional vectors just before the predictor. The experiments are done using PyTorch

[27] and Fairseq [28] All models are trained on 32 GPUs with distributed data parallel (DDP) mechanism. We use standard beam search with beam size of 10 for decoding. The decoded sentence pieces are then concatenated into hypotheses to be compared with ground truth transcription for word error rate (WER) evaluation.

5.2 Model Architectures and Details

We compare architectures with roughly the same number of parameters in total. For the encoder in neural transducer, we evaluate options including 1) BLSTM 4x640: bidirectional LSTM with 4 layers of 640 hidden units in each direction, 2) LSTM 5x1024: LSTM with 5 layers of 1024 hidden units and 3) Transformer 12x: VGG-Transformer with 2 layers of VGGNets and 12 Transformer encoder layers. Each VGGNet layer contain 2 layers of two-dimension convolution of 64 kernels of size 3x3. Each Transformer encoder layer takes 512-dimensional inputs, with 8 heads for multi-head self-attention and 2048 as the feed-forward dimension. For efficient inference, all encoders generate output encodings every 60ms. For LSTM/BLSTM this is achieved with low frame rate [29] in which every three consecutive frames are stacked and subsampled to form the new frame, and apply subsampling of factor 2 to the output of the second LSTM/BLSTM layer [6]. For VGG-Transformer we set the max-pooling on time dimension to 3 for the 1st VGGNet and 2 for the 2nd VGGNet, as illustrated in Fig. 2(b).

For the predictor in neural transducer, we evaluate options including 1) LSTM 2x700: LSTM with 2 layers of 700 hidden units and 2) Transformer 6x: VGG-Transformer with 1 layer of VGGNet and 6 Transformer encoder layers. Both the VGGNet layer and the Transformer encoder layers share the same configuration with the the encoder case, with the exception that max-pooling is removed in the VGGNet. In addition, the right context for these the Transformer encoders is 0 for preventing future information leakage.

For the joiner in neural transducer, outputs from the encoder and the predictor are joined with:


where and project and to a common feature space of dimension ,

is an activation function and

generates the logits . We use , and consistently for all experiments.

5.3 Results on Transformer/LSTM Combinations

We experimented with combinations of Transformer and LSTM networks for neural transducer. The results are summarized in Table 1. For the encoder, we use LSTM 5x1024 as the streamable baseline, BLSTM 5x640 as the non-streamable baseline and Transformer 12x as the novel replacement for the two. For the predictor, we use LSTM 2x700 and Transformer 6x described in section 5.2 as the two options.

encoder predictor # params
(1) LSTM 5x1024 LSTM 2x700 50.5 M 12.31 23.16
(2) BLSTM 4x640 LSTM 2x700 48.3 M 6.85 16.90
(3) Transformer 12x LSTM 2x700 45.7 M 6.08 13.89
(4) LSTM 5x1024 Transformer 6x 67.1 M 15.76 26.67
(5) BLSTM 4x640 Transformer 6x 64.9 M 7.20 16.67
(6) Transformer 12x Transformer 6x 62.3 M 7.11 15.62
Table 1: Neural Transducer with (B)LSTM / Transformer.

From Table 1, given the same configuration for the predictor we see that it is difficult for the LSTM network as encoder to perform well given the constraint on number of parameters. The bidirectional LSTM (BLSTM) network however can compensate the performance and remain compact in size at the cost of being non-streamable. The VGG-Transformer with unlimited self-attention outperforms BLSTM significantly as the encoder and is also non-streamable. For the predictor, for all encoder configurations we see the LSTM network still gives better results than the VGG-Transformer and is smaller in size. As a result we keep LSTM 2x700 as the predictor for the experiments in section 5.4. It is worth noting that the VGG-Transformer loses the advantage of parallel computation as the predictor, as during beam search the hypothesis also extends a token at one search step.

5.4 Results on Truncated Self-Attention

We evaluated the impact of the contexts in truncated self-attention on recognition accuracy for the VGG-Transformer. As summarized in section 5.3, we find the VGG-Transformer performs well as the encoder but not as the predictor. Therefore we keep LSTM 2x700 as the predictor for the experiments in truncated self-attention. The results are summarized in Table 2, where are used for truncated self-attention in the VGG-Transformer per layer and aggregate through layers.

Model Architecture
(1) LSTM 5x1024 + LSTM 2x700 inf 0 12.31 23.16
(2) BLSTM 4x640 + LSTM 2x700 inf inf 6.85 16.90
(3) Transformer 12x + LSTM 2x700 inf inf 6.08 13.89
(4) Transformer 12x + LSTM 2x700 inf 0 12.32 23.08
(5) Transformer 12x + LSTM 2x700 inf 1 6.99 16.88
(6) Transformer 12x + LSTM 2x700 inf 2 6.47 15.79
(7) Transformer 12x + LSTM 2x700 inf 4 6.14 14.86
(8) Transformer 12x + LSTM 2x700 inf 8 5.99 14.17
(9) Transformer 12x + LSTM 2x700 4 4 6.84 17.38
(10) Transformer 12x + LSTM 2x700 8 4 6.69 16.79
(11) Transformer 12x + LSTM 2x700 16 4 6.57 15.92
(12) Transformer 12x + LSTM 2x700 32 4 6.37 15.30
Table 2: Transformer with Truncated Self-Attention.

Since the right context introduces algorithmic latency and has major impact on the recognition accuracy, to find optimal parameters for truncated self-attention, we search for the right context first while keeping the left context unlimited and then reduce the left context given the selected right context . From Table 2 we see both and have significant impact on the performance, especially when when the VGG-Transformer becomes purely causal. However, as increases, the WERs gradually recover and come close to the case of unlimited self-attention when . With limited right context , the VGG-Transformer becomes streamable but still is in computational complexity due to the unlimited left context . To keep reasonable performance while minimizing latency at the same time, we selected right context and evaluate different left contexts . Similar to right context , we see the WER is also sensitive to left context . With we see the VGG-Transformer with truncated self-attention gives better WER than both LSTM/BLSTM baselines. With we only lose 4.7 % on test-clean and 10.1 % on test-other relatively compared with the case of umlimited self-attention, but the system becomes streamable and efficient with computational complexity .

6 Conclusion

In this paper, we explore options for using the Transformer networks in neural transducer for end-to-end speech recognition. The Transformer network uses self-attention for sequence modeling and can compute in parallel. With causal convolution and truncated self-attention, the neural transducer with the proposed VGG-Transformer as the encoder achieved 6.37 % on the test-clean set and 15.30 % on the test-other set of the public corpus LibriSpeech with a small footprint of 45.7 M parameters for the entire system. The proposed Transformer-Transducer is accurate, streamable, compact and efficient, therefore a promising option for resource-limited scenarios such as on-device speech recognition.