Exploring Neural Transducers for End-to-End Speech Recognition

07/24/2017 ∙ by Eric Battenberg, et al. ∙ 0

In this work, we perform an empirical comparison among the CTC, RNN-Transducer, and attention-based Seq2Seq models for end-to-end speech recognition. We show that, without any language model, Seq2Seq and RNN-Transducer models both outperform the best reported CTC models with a language model, on the popular Hub5'00 benchmark. On our internal diverse dataset, these trends continue - RNNTransducer models rescored with a language model after beam search outperform our best CTC models. These results simplify the speech recognition pipeline so that decoding can now be expressed purely as neural network operations. We also study how the choice of encoder architecture affects the performance of the three models - when all encoder layers are forward only, and when encoders downsample the input representation aggressively.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep neural networks have advanced the state-of-the-art on large scale automatic speech recognition (ASR) tasks [24, 28, 2]

. Deep neural networks can not only extract acoustic features, which are used as inputs to traditional ASR models like Hidden Markov Models (HMM)

[24, 28], but also act as sequence transducers, which results in end-to-end neural ASR systems [2, 6].

One major challenge of sequence transduction is that the input and output sequences differ in lengths, and both lengths are variable. As a result, a speech transducer has to learn both the alignment and the mapping between acoustic inputs and linguistic outputs simultaneously. Several neural network-based speech models have been proposed during the past years to solve this challenge. In this work, we focus on understanding the differences between these transduction mechanisms. Specifically, we compare three transduction models - Connectionist Temporal Classification (CTC) [12], RNN-Transducer [11], and sequence-to-sequence (Seq2Seq) with attention [5, 3]. For the ASR task, these models differ mainly along assumptions made in these three axes:

  • Conditional independence between predictions at different time steps, given audio

    . This is not a reasonable assumption for the ASR task. CTC makes this assumption, but RNN-Transducers and Attention models do not.

  • The alignment between input and output units is monotonic. This is a reasonable assumption for the ASR task, which enables models to do streaming transcription. CTC and RNN-Transducers make this assumption, but Attention models 111Here we focus on the vanilla Seq2Seq models with full attention [6, 3], though there exist some efforts in enforcing local and monotonic attention recently, and they typically results in a loss in performance do not.

  • Hard vs Soft alignments. CTC and RNN-Transducer models explicitly treat alignment between input and output as a latent variable and marginalize over all possible hard alignments while the attention mechanism models a soft alignment between each output step and every input step. It is unclear if this matters to the ASR task.

There are no conclusive studies comparing these architectures at scale. In this work, we train all three models on the same datasets using the same methodology, in order to perform a fair comparison. Models which do not assume conditional independence between predictions given the full input (viz, RNN-Transducers, Attention) are able to learn an implicit language model from the training corpus and optimize WER more directly than other models. We find that they therefore perform quite competitively, even outperforming CTC + LM models without the use of an external language model. Among them, RNN-Transducers have the simplest decoding procedure and fewer hyper-parameters to tune.

In the following sections, we will first revisit the three models, and describe interesting specific details of our implementations. Then, in section 3, we present our results on the Hub5’00 benchmark (which uses hours of training data), and our own internal dataset (of hours). In section 4 we study how well they train when using only forward-only layers, and when we do excessive pooling in the encoder layers on the WSJ dataset by controlling the number of parameters in each model. Section 6 presents related work and Section 7 summarizes the key takeaways and presents the scope of future work.

(a) CTC
(b) RNN-Transducer
(c) Attention
Figure 4:

Illustration of probability transitions of three transducers on an utterance of length 5 and labelled as “CAT”. The node at

(horizontal axis), (vertical axis) represents the probability of having output the first elements of the output sequence by point in the transcription sequence. The vertical arrow represents predicting multiple characters at one time step (not allowed for CTC). The horizontal arrow represents predicting repeating characters (for CTC) or predicting nothing (for RNN-Transducer). The solid arrows represent hard alignments (for CTC and RNN-Transducer) and soft ones (for Attention). As noticed, in CTC and RNN-Transducer, states can only move towards the top right direction one step by one, while in Attention, all input frames could potentially be attended in any decoding step.

2 Neural Speech Transducers

A speech transducer is typically composed of an encoder (also known as acoustic model), which transforms the acoustic inputs into high level representations, and a decoder, which produces linguistic outputs (i.e, 

characters or words) from the encoded representations. The challenge is the input and output sequences have variable (also different) lengths, and usually alignments between them are unavailable. So neural transducers have to learn both the classification from acoustic features to linguistic predictions as well as the alignment between them. Transducer models differ in the formulations of the classifier and the aligner.

More formally, given the the input sequence of length , and the output sequence of length , with each being a

dimensional one-hot vector, transducers model the conditional distribution

. The encoder maps the input into a high level representation , which can be shorter than the input (

) with time-scale downsampling. The encoder can be built with feed-forward neural networks (DNNs)


, recurrent neural networks (RNNs)


, or convolution neural networks (CNNs)

[10]. The decoder defines the alignment(s) and the mapping from to .

2.1 Ctc

CTC [12, 2] computes the conditional probability by marginalizing all possible alignments and it assumes conditional independence between output predictions at different time steps given aligned inputs. An extra ‘blank’ label, which can be interpreted as no label, is introduced to map and to the same length, i.e, an alignment (path) is obtained by inserting ( - ) blanks into . A mapping is defined between and , which can be done by removing all blanks and repeating letters in . The conditional probability can be efficiently calculated using a forward-backward dynamic-programming algorithm, as detailed in [12]. Note that the alignments are both local and monotonic.


where we use the conventional definition of softmax222. The CTC output could be decoded by greedily picking the most likely label at each time-step333strictly speaking, this finds the most likely alignment, not , but we find that for a fully trained model is dominated by a single alignment . To make beam search effective, the conditional independence assumption is artificially broken by the inclusion of a language model, and decoding is then the task of finding the argmax of


This decoding is approximate, and performed using beam search, typically with a large beam or lattice [15, 19]

. The above equation presents a discrepancy between how these models are trained and tested. To address this, models could be further fine-tuned with a loss function that also incorporates language model information like sMBR

[24], but the principle issue is still the absence of dependence between predictions.

2.2 RNN-Transducer

RNN-Transducer [11, 14] also marginalizes over all possible alignments, like CTC does, while extending CTC by additionally modeling the dependencies between outputs at different timesteps . More specifically, the prediction of at time step depends on not only aligned input but also the previous predictions .


where donates the output timestep aligned to the input timestep . An extra recurrent network is used to help determine

by predicting decoder logits

, and the conditional distribution at time is computed by normalizing the summation of the and the :


could be any parametric function, we use as in [11]. Like in CTC, the marginalized alignments are local and monotonic, and the likelihood of the label can be calculated efficiently using dynamic programming. Decoding uses beam search as in [11], but we do not use length normalization as originally suggested, since we do not find it necessary.

2.3 Attention Model

Attention model [8, 3, 5] aligns the inputs and outputs using the attention mechanism. Like RNN-transducer, attention model removes the conditional independence assumption in the label sequence that CTC makes. Unlike CTC and RNN-transducer however, it does not assume monotonic alignment, nor does it explicitly marginalize over alignments. It computes by picking a soft alignment between each output step and every input step.


where is the context for decoding timestep , which is computed as the sum of the entire weighted by (known as attention).


where is the hidden states of the decoder at decoding step . There exist different ways [6, 3] to compute . We used a location-aware hybrid attention mechanism in our experiments, which can be described as:


The attention mechanism allows the model to attend anywhere in the input sequence at each time, and thus the alignments can be non-local and non-monotonic. However, this excessive generality comes with a more complicated decoding for the ASR task, since these models can both terminate prematurely as well as never terminate by repeatedly attending over the same encoding steps. Therefore, the decoding task finds the argmax of



is the length normalization hyperparameter

[27]. The coverage term “cov” encourages the model to attend over all encoder time steps, and stops rewarding repeated attendance over the same time steps. The coverage term addresses both short as well as infinitely long decoding.

3 Performance at Scale

In this section, we compare the performance of the models on a public benchmark as well as our own internal dataset.

The promise of end-to-end models for ASR was the simplification of the training and inference pipelines of speech systems. End-to-end CTC models only simplified the training process, but inference still involves decoding with massive language models, which often requires teams to build and maintain complicated decoders. Since attention and RNN-Transducers implicitly learn a language model from the speech training corpus, rescoring or decoding using language models trained solely from the text of the speech corpus, does not contribute to improvements in WER (Table 1). When an external LM trained on more data is available, simply rescoring the final beam (typically small, between 32 and 256) recovers all the performance difference (Table 3

). The decoding and beam search is therefore simplified, can be expressed as neural network operations and need not support massive language models. This trend is already seen in the neural machine translation tasks, where state-of-art NMT systems do not typically use an external language model


3.1 Hub5’00 results

The performance of the models on the Hub5’00 benchmark is presented in Table 1 along with other published results on in-domain data. All of the models in Table 1 use the standard language model that is paired with the dataset, except for the rows marked “NO LM”. Without using any language model, both the attention and RNN-Transducer models outperform the CTC model trained on the same corpus, and are highly competitive with the best results on this dataset. Since the LM is also trained on the same training corpus, rescoring with the LM has little effect on attention and RNN-Transducer models.

We found that beam search in attention worked best when using only length normalization (, in Equation 16). However, as the distribution of errors in Table 2 show, the RNN-Transducer has no obvious problems with pre-mature termination as the number of deletions is very small even though there is no length normalization. Attention and RNN-Transducer both use a beam width of 32.

Architecture SWBD CH
Published Iterated-CTC [29] 11.3 18.7
BLSTM + LF MMI [21] 8.5 15.3
LACE + LF MMI 444An unreported result using RNN-LM trained on in-domain text could be better than this result [28] 8.3 14.8
Dilated convolutions [25] 7.7 14.5
CTC + Gram-CTC [17] 7.3 14.7
BLSTM + Feature fusion[23] 7.2 12.7
Ours CTC [17] 9.0 17.7
     Beam Search NO LM 8.5 16.4
     Beam Search + LM 8.1 17.5
     Beam Search NO LM 8.6 17.8
     Beam Search + LM 8.6 17.8
Table 1: WER comparison against previous published results on Fisher-Switchboard Hub5’00 benchmark using in-domain data. We only list results using single models here. All the previous works reported WER using language models. We don’t leverage any speaker information in our models, though it has been shown to reduce WER in previous works [28, 25].
Model WER Subs Ins Dels
CTC 9.0 5.5 2.5 1.0
RNN-Transducer 8.1 4.7 2.6 0.8
Attention 8.6 5.4 1.2 2.0
Table 2: Error distribution for SWBD slice in Hub5’00
Model Dev Test
CTC [4]
   Greedy decoding 23.03 -
   Beam search + LM (beam=2000) 15.9 16.44
   Greedy decoding 18.99 -
   Beam search (beam=32) 17.41 -
      + LM rescoring 15.6 16.50
   Greedy decoding 22.67 -
   Beam search (beam=256) 18.71 -
      + Length-norm weight 19.5 -
      + Coverage cost 18.9 -
            + LM rescoring 16.0 16.48
Table 3: Comparison of WER obtained by different transduction models on the DeepSpeech dataset which has a mismatch between training and test distributions.
Model Prediction
Ground Truth SILENCE
RNN-Transducer SILENCE
Attention i want to get to get to get to get to
get to get to get to get to do that
Ground Truth play the black eyed peas songs
    + Greedy lading to black irpen songs
    + Beam Search + LM leading to black european songs
      + Greedy play the black eye piece songs
      + Beam Search play the black eye piece songs
          + LM rescore play the black eyed peas songs
      + Greedy play the black eyed pea songs
      + Beam Search play the black eyed pea songs
          + LM rescore play the black eyed peas songs
Table 4: Samples from decoding the same utterance across different models on the DeepSpeech dev set. We find that a big reason for the relatively worse WER of the attention model could be attributed to a few utterances like the first one which contributes to the edit distance a lot. The first example shows only greedy decoding cases for all the models, the second set shows how the prediction evolves through various stages of decoding.

3.2 DeepSpeech corpus

The DeepSpeech corpus contains about hours of speech in a diverse set of scenarios, such as far-field, with background noise, accents etc., Additionally, the train and targets sets are drawn from a different distribution since we don’t have access to large volumes of data from the target distribution. We rely on external language models trained on significantly larger corpus of text to close the gap between train and test distributions. This setting therefore provides us the best opportunity to study the impact of language models on attention and RNN-Transducers.

On the development set, note that RNN-Transducer model matches the performance of the best CTC model within 1.5 WER without any language model, and completely closes the gap by rescoring the resulting beam of only 32 candidates. Surprisingly, attention models start from a WER similar to that of CTC models after greedy decoding, but the two architectures make very different errors. CTC models have a poorer WER mainly because of mis-spellings, but the relatively higher WER of attention models could be largely attributed to noisy utterances. In these cases, the attention models act similar to a language model and arbitrarily output characters while repeatedly attending over the same encoder time steps. While the coverage term in Equation 16 helps address this issue during beam search, the greedy decoding cannot be improved. An example of this situation is shown in Table 4. The monotonic left-to-right decoding of CTC and RNN-Transducers naturally avoid these issues. Further, the coverage term only helps keep the correct answers in the beam and language model rescoring of the final beam is still required to bring the correct answers back to the top.

3.3 Experimental details

Data specification

. Throughout the paper, all audio data is sampled at 16kHz and normalized to a constant power. Log-Linear or Log-Mel spectrograms (the specific type of featurization is a hyper-parameter we tune over) are extracted with a hop size of 10ms and window size of 20ms, and then globally normalized so that each input spectrogram bin has zero mean and unit variance. We do not use speaker information in any of our models. Every epoch, 40% of the utterances are randomly selected to add background noise to.

All models in Table 1, were trained on the standard Fisher-Swbd dataset comprising of the LDC corpora (97S62, 2004S13, 2004T19, 2005S13, 2005T19). We use a portion of the RT02 corpus (2004S11) for hyper-parameter tuning. The language model used for decoding the CTC model as well as when rescoring the other models is the same 4-gram LM available for this benchmark from the Kaldi receipe [20]. The language model used by all models in Table 3 is built from a sample of the common crawl dataset [26].

Model specification. All models in Tables 1 and 3 are tuned independent of each other - we perform a random search over encoder and decoder sizes, amount of pooling, minibatch size, choice of optimizer, learning and annealing rates. Further, no constraints are placed on any model, in terms of number of parameters, wall clock time, or others.

The training procedure mainly follows [2]

, and uses SortaGrad, and all models use bi-directional ReLU GRU encoders with batch-normalization through depth

555 We also find that these encoder layers could be replaced with LSTM layers with tanh activation, weight noise, and no batch normalization. In most cases, only 512 LSTM cells with weight noise can match the performance of large un-regularized GRU cells with batch-normalization, and may use a convolutional front-end. In short hand, [2x2D-Conv (2), 3x2560 GRU] represents a stack of 2 layers of 2D-convolution followed by a stack of 3 bidirectional ReLU GRU. “(2)” represents that the layer downsamples the input by 2 along the time dimension. In short hand, the best CTC model is [2x2D-Conv (2), 3x2560 GRU], the best RNN-Transducer’s encoder is [2x2D-Conv (2), 4x2048 GRU] and decoder is [3x1024 Fwd-GRU]. The best attention model works best without a convolutional front-end, the encoder is [4x2560 GRU (4)] and the decoder is [1x512 Fwd-GRU]. All models therefore have about 120M parameters. All models were trained with a minibatch of 512 on 16 M40 gpus using synchronous SGD, and typically converge within 70k iterations to the final solution.

4 Impact of encoder architecture

In this section, we use the standard WSJ dataset to understand how the models perform with different encoding choices. Since encoder layers are far away from the loss functions we are evaluating, one expect that an encoder that works well on CTC would also perform well on attention and RNN-Transducer. However, different training targets allow for different kinds of encoders: particularly, 1) the amount of downsampling in the encoder is an important factor that impacts both training wall clock time as well as the accuracy of the model. 2) Encoders with forward-only layers also allow for streaming decoding, so we also explore that aspect. We believe that these results on the smaller and more uniform dataset should still hold at scale, and therefore focus on the trends rather than optimizing for WER.

We control all the models in this section to have 4 layers of 256 bidirectional LSTM cells in the encoder, with weight noise. We perform random search over pooling in the encoder, whether to use a convolutional front-end, data augmentation, weight noise and optimization hyper-parameters. We report the best numbers within the first 60k iterations of training 666Better results are observed for all models if they are trained for 400k iterations - e.g, a WER of 15.72 for Attention model after beam search on the WSJ dev’93 set - but the conclusions of comparison remain unchanged.. This search over hyper-parameter space has allowed us to match previously published results. The attention model in Table 5 has a WER of 17.4 after beam search on the WSJ dev’93 set, which matches the previously published results (17.9) in  [9]. Similarly, the CTC model has better results than reported in  [13]. We therefore believe that this provides a good baseline to explore the trade-offs in modeling choices.

4.1 Forward-only encoders

Streaming transcription is an important requirement for ASR models. The first step towards deploying these models in this setting is to replace the bidirectional layers with forward-only recurrent layers. Note that while this immediately makes CTC and RNN-Transducer models deployable, attention models still need to be able to process the entire utterance before outputting the first character. Alternatives have been proposed to circumvent this issue [22, 1] and build attention models with monotonic attention and streaming decoders, but none of them are able to completely match the performance of the full attention models. Nevertheless, we believe a comparison with models with full attention is important for us to find out if full attention over the entire audio provides additional performance or improves training. In our experiment, we replace every layer of 256 bidirectional LSTM cells in the encoder with a layer of 512 forward-only LSTM cells.

Model Bidirectional Forward-only
Decoding Greedy Beam Beam
No LM + LM + LM
CTC 15.73 10.08 13.78
RNN-Transducer 15.29 14.05 22.38
Attention 14.99 14.07 19.19
Table 5: WER of baseline models on WSJ eval’92 set. On smaller datasets, RNN-Transducers and Attention models do not have enough data to learn a good implicit language model and therefore perform poorer compared to CTC even after rescoring with an external LM (RNN-Transducers and Attention models learn a better implicit language model at scale, as shown in Tables 1 and 3).

From Table 5, we find that CTC models are significantly more stable, easier to train and perform better in the forward only setting. Also, since the attention models are quite a bit better than RNN-Transducer models, the full attention over all encoder time steps seems to be valuable.

4.2 Downsampling in the encoder

Figure 5: Effect of increasing the frame-rate on WER

One effective way to control both the memory usage as well as the training time of these models is to compress along the time dimension in the encoder, so that the recurrent layers are unrolled over fewer time-steps. Previous results have shown that CTC models work best at 50 steps per second of audio [2] (a reduction since spectrograms are often made at 100 steps per second of audio), and attention models work best at about 12 steps per second of audio [6]. So given the same encoder architecture, the final encoder layer on an attention model with 3 layers of pyramidal pooling has lesser compute when compared to a CTC model. This is important since the attention now only needs to be computed over such a small number of encoder time steps.

Figure 6: Visualization of learned alignments for the same utterance using CTC (left), RNN-Transducer (middle), and Attention (right). The alignments are between ground-truth text (y-axis) and audio features fed into the decoder(x-axis). Note that Attention does two more time-scale downsampling, which results in shorter sequences (x axis) compared to the other two.

Since RNN-Transducers and attention models can output multiple characters for the same encoder timestep, we expect RNN-Transducers to be as robust as attention models as we increase the amount of pooling in the encoder. While Figure 5 shows that they are fairly robust compared the CTC models, we find that attention models are significantly more robust. In addition, we have successfully trained attention models with up to 5 layers of pooling - reduction in the encoder which forces to compress one second of audio into only 3 encoder steps.

5 Alignment Visualization

The three transduction models formulate the alignments between input and output in different ways. CTC and RNN-Transducer models explicitly treat alignment as a latent variable and marginalize over all possible hard alignments while attention models a soft alignment between each output step and every input step. In addition, RNN-Transducer and Attention models allow for producing multiple characters by reading the same input locations while CTC can only produce one.

Herein, we visualize the alignments learned by three models to understand the formulations made by each model. Figure 6 plots the alignment for one utterance from the WSJ devset. Since the alignment is computed based on ground-truth text (instead of predictions), all three models produce reasonable alignments, especially being monotonic for Attention. Several notable observations are listed as below:

  • We can see the small jumps along x-axis in the left subfigure, as CTC inserts blanks into output labels in order to align with inputs.

  • Multiple attending (producing characters) along the same input (the same column) can be found in RNN-Transducer (middle) and Attention (right) models.

  • The alignments computed by CTC and RNN-Transducer are more concentrated (or peaky) compared to that of Attention. In addition, Attention model produces diffused distributions at the beginning of the audio.

6 Related Work

Segmental RNNs [18] provide another alternative way to model the ASR task. Segmental RNNs model using a zeroth-order CRF. While global normalization help address the label bias issues in CTC, we believe that the bigger issue is still the conditional independence assumptions made by both CTC and Segmental RNNs.

[5, 8, 3] directly compare the WERs of attention models with those of CTC and RNN-transducer listed in the original papers, without any control in either acoustic models or optimization methodology. [7] did an initial controlled comparison over several speech transduction models, but only present results on a small datset - TIMIT.

There is also some recent effort [22, 1] in introducing local and monotonic constraints into attention models especially for online applications. These efforts will in theory bridge the modelling assumptions between attention and RNN-transducer models. With these constraints, the fitting capability of attention models would be limited, but they might be more robust to noisy test data in return. In other words, attention models can work without extra tricks during beam search decoding, e.g, , coverage penalty.

7 Conclusion and Future Work

We present a thorough comparison of three popular models for the end-to-end ASR task at scale, and find that in the bidirectional setting, all three models perform roughly the same. However, these models differ in the simplicity of their training and decoding pipelines. Notably, end-to-end models trained with the CTC loss, simplify the training process but still require to be decoded with large language models. RNN-Transducers and Attention also simplify the decoding process and require the language models to be introduced only in a post processing stage to be equally if not more effective. Between these two, RNN-Transducers have the simplest decoding process with no extra hyper-parameters tuning for decoding, which leads us to believe that RNN-Transducers present the next generation of end-to-end speech models. In attempt to train RNN-Transducer models with the streaming constraint, and in reducing computation in encoder layers, we find that CTC and attention models still have strengths that we aim to leverage in our future work with RNN-Transducers.

8 Acknowledgements

We would like to thank Xiangang Li, of the Baidu Speech Technology Group for feedback about the work and also helping improve the draft.